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Learning Mixtures of Gaussians in High Dimensions 


Rong Ge* Qingqing Huang' Sham M. Kakade* 


Abstract 

Efficiently learning mixture of Gaussians is a fundamental problem in statistics and learning 
theory. Given samples coming from a random one out of k Gaussian distributions in R n , the 
learning problem asks to estimate the means and the covariance matrices of these Gaussians. 
This learning problem arises in many areas ranging from the natural sciences to the social 
sciences, and has also found many machine learning applications. 

Unfortunately, learning mixture of Gaussians is an information theoretically hard problem: 
in order to learn the parameters up to a reasonable accuracy, the number of samples required 
is exponential in the number of Gaussian components in the worst case. In this work, we show 
that provided we are in high enough dimensions, the class of Gaussian mixtures is learnable in 
its most general form under a smoothed analysis framework, where the parameters are randomly 
perturbed from an adversarial starting point. 

In particular, given samples from a mixture of Gaussians with randomly perturbed parame¬ 
ters, when n > U(/c 2 ), we give an algorithm that learns the parameters with polynomial running 
time and using polynomial number of samples. 

The central algorithmic ideas consist of new ways to decompose the moment tensor of the 
Gaussian mixture by exploiting its structural properties. The symmetries of this tensor are 
derived from the combinatorial structure of higher order moments of Gaussian distributions 
(sometimes referred to as Isserlis’ theorem or Wick’s theorem). We also develop new tools for 
bounding smallest singular values of structured random matrices, which could be useful in other 
smoothed analysis settings. 
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1 Introduction 


Learning mixtures of Gaussians is a fundamental problem in statistics and learning theory, whose 
study dates back to Pearson (1894). Gaussian mixture models arise in numerous areas including 


physics, biology and the social sciences (McLachlan and Peel (2004); Titterington et al. (1985)), as 
well as in image processing (Reynolds and Rose (1995)) and speech (Permuter et al. (2003)). 

In a Gaussian mixture model, there are k unknown ?i-dimensional multivariate Gaussian distri¬ 
butions. Samples are generated by first picking one of the k Gaussians, then drawing a sample from 
that Gaussian distribution. Given samples from the mixture distribution, our goal is to estimate 
the means and covariance matrices of these underlying Gaussian distribution^] 

This problem has a long history in theoretical computer science. The seminal work of Dasgupta 


(1999) gave an algorithm for learning spherical Gaussian mixtures when the means are well sepa¬ 


rated. Subsequent works (Dasgupta and Schulman (2000); Sanjeev and Kannan ( 2001| ); Vempala 
and Wang (2004); Brubaker and Vempala (2008)) developed better algorithms in the well-separated 
case, relaxing the spherical assumption and the amount of separation required. 


When the means of the Gaussians are not separated, after several works (Belkin and Sinha 
(2009); Kalai et al. (2010)), Belkin and Sinha (2010) and Moitra and Valiant ( 2010| ) independently 
gave algorithms that run in polynomial time and with polynomial number of samples for a fixed 
number of Gaussians. However, both running time and sample complexity depend super expo¬ 
nentially on the number of components Their algorithm is based on the method of moments 
introduced by Pearson (1894): first estimate the 0(h )-order moments of the distribution, then try 


to find the parameters that agree with these moments. Moitra and Valiant (2010) also show that 


the exponential dependency of the sample complexity on the number of components is necessary, 
by constructing an example of two mixtures of Gaussians with very different parameters, yet with 
exponentially small statistical distance. 


Recently, Hsu and Kakade (2013) applied spectral methods to learning mixture of spherical 
Gaussians. When n > k + 1 and the means of the Gaussians are linearly independent, their 
algorithm can learn the model in polynomial time and with polynomial number of samples. This 


result suggests that the lower bound example in Moitra and Valiant (2010) is only a degenerate 


case in high dimensional space. In fact, most (in general position) mixture of spherical Gaussians 
are easy to learn. This result is also based on the method of moments, and only uses second and 


third moments. Several follow-up works (Bhaskara et al. (2014); Anderson et al. (2013)) use higher 
order moments to get better dependencies on n and k. 


However, the algorithm in Hsu and Kakade (2013) as well as in the follow-ups all make strong 


requirements on the covariance matrices. In particular, most of them only apply to learning mixture 
of spherical Gaussians. For mixture of Gaussians with general covariance matrices, the best known 


result is still Belkin and Sinha (2010) and Moitra and Valiant (2010), which algorithms are not 


polynomial in the number of components k. This leads to the following natural question: 

Question: Is it possible to learn most mixture of Gaussians in polynomial time using a polynomial 
number of samples? 


Our Results In this paper, we give an algorithm that learns most mixture of Gaussians in high 
dimensional space (when n > H(I 2 )), and the argument is formalized under the smoothed analysis 


framework first proposed in Spielman and Teng (2004). 


This is different from the problem of density estimation considered in 


In fact, it is in the order of 0(e°^ ) as shown in Theorem 11.3 in Valiant (2012). 


Feldman et al. 


(2006); 


Chan et al. 


(2014) 
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In the smoothed analysis framework, the adversary first choose an arbitrary mixture of Gaus- 
sians. Then the mean vectors and covariance matrices of this Gaussian mixture are randomly 
perturbed by a small amount p 0 The samples are then generated from the Gaussian mixture 
model with the perturbed parameters. The goal of the algorithm is to learn the perturbed param¬ 
eters from the samples. 

The smoothed analysis framework is a natural bridge between worst-case and average-case 
analysis. On one hand, it is similar to worst-case analysis, as the adversary chooses the initial 
instance, and the perturbation allowed is small. On the other hand, even with small perturbation, 
we may hope that the instance be different enough from degenerate cases. A successful algorithm 
in the smoothed analysis setting suggests that the bad instances must be very “sparse” in the 
parameter space: they are highly unlikely in any small neighborhood of any instance. Recently, 


the smoothed analysis framework has also motivated several research work (Kalai et al. (2009) 


Bhaskara et al. (2014)) in analyzing learning algorithms. 


In the smoothed analysis setting, we show that it is easy to learn most Gaussian mixtures: 


Theorem 1.1. (informal statement of Theorem 3.f) In the smoothed analysis setting, when n > 


Gl(k 2 ), given samples from the perturbed n-dimensional Gaussian mixture model with k components, 
there is an algorithm that learns the correct parameters up to accuracy e with high probability, using 
polynomial time and number of samples. 

An important step in our algorithm is to learn Gaussian mixture models whose components 
all have mean zero, which is also a problem of independent interest (Zoran and Weiss. (2012)). 
Intuitively this is also a “hard” case, as there is no separation in the means. Yet algebraically, this 
case gives rise to a novel tensor decomposition algorithm. The ideas for solving this decomposition 
problem are then generalized to tackle the most general case. 


Theorem 1.2. (informal statement of Theorem 3.5) In the smoothed analysis setting, when n > 
Gl(k 2 ), given samples from the perturbed mixture of zero-mean n-dimensional Gaussian mixture 
model with k components, there is an algorithm that learns the parameters up to accuracy e with 
high probability, using polynomial running time and number of samples. 


Organization The main part of the paper will focus on learning mixtures of zero-mean Gaussians. 
The proposed algorithm for this special case contains most of the new ideas and techniques. In 
Section [2] we introduce the notations for matrices and tensors which are used to handle higher 
order moments throughout the discussion. Then in Section [3] we introduce the smoothed analysis 
model for learning mixture of Gaussians and discuss the moment structure of mixture of Gaussians, 
then we formally state our main theorems. Section [4] outlines our algorithm for learning zero-mean 
mixture of Gaussians. The details of the steps are presented in Section [5] The detailed proofs for 
the correctness and the robustness are deferred to Appendix (Sections [B] to 0- In Section [b] we 
briefly discuss how the ideas for zero-mean case can be generalized to learning mixture of nonzero 
Gaussians, for which the detailed algorithm and the proofs are deferred to Appendix |F} 


2 Notations 


Vectors and Matrices In the vector space M n , let (•, •) denote the inner product of two vectors, 
and II • || to denote the Euclidean norm. 


3 See Definition 


3.2 


in Section 


3.1 


for the details. 
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For a tall matrix A G R mxn , let Ay^ denote its j-th column vector, let A T denote its transpose, 
A* = {AJA)~ 1 A J denote the pseudoinverse, and let o^A) denote its fc-th singular value. Let I n 
be the identity matrix of dimension n x n. The spectral norm of a matrix is denoted as || • ||, and 
the Frobenius norm is denoted as || • || p. We use A >z 0 for positive semidefinite matrix A. 

In the discussion, we often need to convert between vectors and matrices. Let vec(A) G R mn 
denote the vector obtained by stacking all the columns of A. For a vector x G R m , let rnat(x') G 
M rn x m denote the inverse mapping such that vec(mat(x)) = x. 

We use [re] to denote the set {l,2,...,n} and [re] x [re] to denote the set {( i,j ) : i,j G [re]}. These 
are often used as indices of matrices. 

Symmetric matrices We use to denote the space of all re x re symmetric matrices, which 

subspace has dimension ( n ^ 1 ). Since we will frequently use re x re and kxk symmetric matrices, we 
denote their dimensions by the constants re 2 = (”^ 1 ) and = (^i^ 1 ). Similarly, we use R”^' xn to 
denote the symmetric /c-dimensional multi-arrays (tensors), which subspace has dimension ( n+ £ 1 )- 
If a k -th order tensor X G then for any permutation 7 r over [k], we have X m = 

VJ^TT(fc) * 

Linear subspaces We represent a linear subspace S G M n of dimension d by a matrix S G M nxd , 
whose columns of S form an (arbitrary) orthonormal basis of the subspace. The projection matrix 
onto the subspace S is denoted by Proj s = SS T , and the projection onto the orthogonal subspace 
S 1 - is denoted by Projgx = I n — SS T . When we talk about the span of several matrices, we mean 
the space spanned by their vectorization. 

Tensors A tensor is a multi-dimensional array. Tensor notations are useful for handling higher 
order moments. We use <8> to denote tensor product, suppose a, b, c G M n , T = a <S> b (g> c G R nxnxn 
and Tj li j 2; j 3 = a tt b V2 c l?j . For a vector x G R n , let the f-fold tensor product x<8>* denote the t -th order 
rank one tensor (x<g> f )ii,i 2 ,...,u = | [ x iy 

Every tensor defines a multilinear mapping. Consider a 3-rd order tensor X G M. nAXnBXnc . For 
given dimension rriA,mB,mc, it defines a multi-linear mapping X(-,-,-) : R nAXm/l x R n s xm s x 
W lcXm c x m B x m c defined as below: (Vji G [m^], j '2 € [m#],^ G [me]) 

(Vi , V2 , V3 )]jij 2 j 3 = Xi 1 } i 2 ) i 3 [Pl]ji,ii [P2]j2,*2 [^ 3 ]i 3 ,* 3 - 

*1 e [n A ],i2 6 [ns] ,*3 S [n c ] 

If X admits a decomposition X = Y2i=i ® By,i] ® Cy^ for A G R nAXk , B G M nsXfc , C G M ncXfc , 
the multi-linear mapping has the form X(Vi,V 2 , V 3 ) = X^=i(^i T ^[:,i]) ® (^2 ® (P 3 

In particular, the vector given by X(ei,ej, I) is the one-dimensional slice of the 3-way array, 
with the index for the first dimension to be i and the second dimension to be j. 
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Matrix Products We use © to denote column wise Katri-Rao product, and ©j, r to denote 
Kronecker product. As an example, for matrices A E M m ^ xn 5 B E R^x", C E M mcX ": 

n 

A®B®CZ R™AXm B xm c ^ [ A ® B ® C\ jlj2 j 3 = A hd B hd C . A© 

2—1 


AQB E 

i AQB k-J} =A l-J] B [;0V 




A hl B ■ 

A\^ n B 

A ©fc r B E R m Am B xn 2 

A B — 





AmA,l B 

A R 


3 Main results 

In this section, we first formally introduce the smoothed analysis framework for our problem and 
state our main theorems. Then we will discuss the structure of the moments of Gaussian mixtures, 
which is crucial for understanding our method of moments based algorithm. 


3.1 Smoothed Analysis for Learning Mixture of Gaussians 


Let Q n ,k denote the class of Gaussian mixtures with k components in M n . A distribution in this 
family is specified by the following parameters: the mixing weights uii, the mean vectors and 
the covariance matrices SW, for i €[*]. 


Qn, k := = 


e 




k 

E 

2—1 


0Ji = 1, E 




pxn 
sym 5 


E w © 0 


As an interesting special case of the general model, we also consider the mixture of “zero-mean” 
Gaussians, which has = 0 for all components i E [k]. 

A sample x from a mixture of Gaussians is generated in two steps: 


1. Sample h E [k] from a multinomial distribution, with probability Pr [h = i] = uii for i E [k]. 

2. Sample x E M n from the h-th Gaussian distribution E^). 


The learning problem asks to estimate the parameters of the underlying mixture of Gaussians: 


Definition 3.1 (Learning mixture of Gaussians). Given N samples X\,X 2 , ...,xn drawn i.i.d. from 
a mixture of Gaussians Q = {{oJi, S^)}j 6 rw, an algorithm learns the mixture of Gaussians with 
accuracy e, if it outputs an estimation Q = {(cDE^)}j e rw such that there exists a permutation 
it on [k], and for all i E [A:], we have |cD* — < e, \\jl W — \\ < e and ||EW — y;( 7r W)|| < e _ 


In the worst case, learning mixture of Gaussians is a information theoretically hard problem 
(Moitra and Valiant (2010)). There exists worst-case examples where the number of samples 


required for learning the instance is at least exponential in the number of components k (McLachlan 


and Peel (2004])). The non-convexity arises from the hidden variable h: without knowing h we 


cannot determine which Gaussian component each sample comes from. 

The smoothed analysis framework provides a way to circumvent the worst case instances, yet 
still studying this problem in its most general form. The basic idea is that, with high probability 
over the small random perturbation to any instance, the instance will not be a “worst-case” instance, 
and actually has reasonably good condition for the algorithm. 

Next, we show how the parameters of the mixture of Gaussians are perturbed in our setup. 
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Definition 3.2 (/3-smooth mixture of Gaussian). For p < 1/n, a p-smooth n-dimensional k- 
component mixture of Gaussians Q = {(£;», f^ l \ ^^)}ie[fc] ^ Qn,k is generated as follows: 

1. Choose an arbitrary (could be adversarial) instance Q = {(uij, p) l \ SW)} ie rw G Q n .k- Scale the 
distribution such that 0 ^ EM -< I/ n and ||//W|| < \ for all i G [A;]. 

2. Let A i G 6e a random symmetric matrix with zeros on the diagonals, and the upper- 

triangular entries are independent random Gaussian variables J\f(0,p 2 ). Let 8i G M n be a 
random Gaussian vector with independent Gaussian variables Af(0, p 2 ). 

3. Set Ui = Ui, W = /xW + <5,, E« = EW + A*. 

4- Choose the diagonal entries o/S^ arbitrarily, while ensuring the positive semi-definiteness of 
the covariance matrix E®, and the diagonal entries are upper bounded by 1. The perturbation 
procedure fails if this step is infeasibl ^ } 

A p-smooth zero-mean mixture of Gaussians is generated using the same procedure, except that we 
set jj,^ = = 0, for all i G [k]. 

Remark 3.3. When the original matrix is of low rank, a simple random perturbation may not lead 
to a positive semidefinite matrix, which is why our procedure of perturbation is more restricted in 
order to guarantee that the perturbed matrix is still a valid covariance matrix. 

There could be other ways of locally perturbing the covariance matrix. Our procedure actually 
gives more power to the adversary as it can change the diagonals after observing the perturbations 
for other entries. Note that with high probability if we just let the new diagonal to be 5y/np larger 
than the original ones, the resulting matrix is still a valid covariance matrix. In other words, the 
adversary can always keep the perturbation small if it wants to. 

Instead of the worst-case problem in Definition |3.1[ our algorithms work on the smoothed 
instance. Here the model first gets perturbed to Q = {(a 5 j,//W, E^)} the samples are drawn 
according to the perturbed model, and the algorithm tries to learn the perturbed parameters. We 
give a polynomial time algorithm in this case: 

Theorem 3.4 (Main theorem). Consider a p-smooth mixture of Gaussians Q = {(uij, £c^)}ie[fc] G 
Qn.k f or which the number of components is at ZeastJ^A; > Co and the dimension n > C\k 2 , for some 
fixed constants Co and C\. Suppose that the mixing weights up > uv 0 for all i G [A;]. Given N samples 
drawn i.i.d. from Q, there is an algorithm that learns the parameters of Q up to accuracy e, with 
high probability over the randomness in both the perturbation and the samples. Furthermore, the 
running time and number of samples N required are both upper bounded by poly(n, k, l/co 0 , 1/e, 1 / p). 


To better illustrate the algorithmic ideas for the general case, we first present an algorithm for 
learning mixtures of zero-mean Gaussians. Note that this is not just a special case of the general 
case, as with the smoothed analysis, the zero mean vectors are not perturbed. 


Theorem 3.5 (Zero-mean). Consider a p-smooth mixture of zero-mean Gaussians Q = {(w ? ;, 0, Z/d)},; e r/.i G 
Qn.k for which the number of components is at least k > Co and the dimension n> C\k 2 , for some 
fixed constants Cq and C\. Suppose that the mixing weights w,; > co 0 for all i G [A;]. Given N 


Note that by stan dard random matrix theory, with high probability the 4-th step is feasible and the perturbation 


procedure in Definition 3.2 succeeds. Also, with high probability we have H/h^H < 1 and 0 S r< In for all i £ [k]. 


5 Note that the algorithms of 
fixed k. 


Belkin and Sinha 


2010 j) and 


Moitra and Valiant 


12010) run in polynomial time for 
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samples drawn i.i.d. from Q, there is an algorithm that learns the parameters of Q up to accuracy e, 
with high probability over the randomness in both the perturbation and the samples. Furthermore, 
the running time and number of samples N are both upper bounded by poly(n , k, l/u 0 , 1/e, 1/ p). 

Throughout the paper we always assume that n > C\k 2 and u>i > uj 0 . 


3.2 Moment Structure of Mixture of Gaussians 


Our algorithm is also based on the method of moments, and we only need to estimate the 3-rd, 
the 4-th and the 6-th order moments. In this part we briefly discuss the structure of 4-th and 6-th 
moments in the zero-mean case (3-rd moment is always 0 in the zero-mean case). These structures 
are essential to the proposed algorithm. For more details, and discussions on the general case see 
Appendix [A| 

The 777 .-th order moments of the zero-mean Gaussian mixture model Q E Q n ,k are given by the 
following m-th order symmetric tensor M m E 

k 

:= K [Th ■ • • x j m ] = ^lE 

i =1 

where y® corresponds to the n-dimensional zero-mean Gaussian distribution AA(0, £®). The mo¬ 
ments for each Gaussian component are characterized by Isserlis’s theorem as below: 

Theorem 3.6 (Isserlis’ Theorem). Let (y 4 ,..., y 2 t) be a multivariate zero-mean Gaussian random 
vector jV(0, X), then 

%i • • ■ U2t ]= 

where the summation is taken over all distinct ways of partitioning y\,..., ipit into t pairs, which 
correspond to all the perfect matchings in a complete graph. 

Ideally, we would like to obtain the following quantities (recall 112 = ( n ^ 1 )): 


(*) 

4 


00 

y) 

Jm 


Vji, • • • ,j m e M, 


X 4 = cjivec 
i=1 


(£ W )<g > 2 E IT 2 


xn 2 


Xq = ujjvec(T 1 1 ' 

i=l 


E l " 2 


xn 2 xn 2 


( 1 ) 


Note that the entries in X 4 and Xq are quadratic and cubic monomials of the covariance 
matri ces, re spectively. If we have X 4 and Xq, the tensor decomposition algorithm in Anandkumar 
(2014) can be immediately applied to recover wf s and £®’s under mild conditions. It is easy 


et al. 


to verify that those conditions are indeed satisfied with high probability in the smoothed analysis 
setting. 

By Isserlis’s theorem, the entries of the moments M4 and Mq are indeed quadratic and cubic 
functions of the covariance matrices, respectively. However, the structure of the true moments M 4 
and Mq have more symmetries, consider for example, 


[M 4 ] 


1 , 2 , 3,4 = 


y>(sS* 


(®) y(') 1 y(®) y(0 

2^3,4 “V Zj 1,3 Zj 2,4 


1 y W yM 'i 
+ ^1,4^2,31 


i= 1 


k 

while [X 4 ](i i 2 ),( 3 , 4 ) = 

i= 1 


Note that due to symmetry, the number of distinct entries in M 4 ( ( n 4 3 ) ~ n 4 /24) is much smaller 
than the number of distinct entries in X 4 (( n2 2 1 " 1 ) ~ 77 4 / 8 ). Similar observation can be made about 
Mq and Xq. 
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Therefore, it is not immediate how to find the desired X 4 and Xq based on M 4 and Mg. We 
call the moments M 4 , Mg the folded moments as they have more symmetry, and the corresponding 
X 4 , Xq the unfolded moments. One of the key steps in our algorithm is to unfold the true moments 
M 4 , Mg to get X 4 , Xq by exploiting special structure of M 4 , Mg. 

In some cases, it is easier to restrict our attention to the entries in M 4 with indices corresponding 
to distinct variables. In particular, we define 


AI 4 = [[M 4 ] ilj2ii3J 4 : 1 < j\ < j 2 < J 3 < h < n\ e ^ n4 , 


(2) 


where n 4 = (2) is the number of 4-tuples with indices corresponding to distinct variables. We 
define Mg £ M ne similarly where ne = Q). We will see that these entries are nice as they are 
linear projections of the desired unfolded moments X 4 and Xq (Lemma 3.7 below), also such 
projections satisfy certain “symmetric off-diagonal” properties which are convenient for the proof 
(see Definition C.3 in Section [C]). 


Lemma 3.7. For a zero-mean Gaussian mixture model, there exist two fixed and known linear 
mappings X 4 : M n2Xn2 —y M™ 4 and Xq : M™ 2 xn 2 xn 2 jpe suc h that: 


M 4 = VsX^Xa), Me = V\5Te{Xe). 


(3) 


Moreover X 4 is a projection from a -dimensional subspace to a 714 -dimensional subspace, and 

Xe is a projection from a (J 12 ^ 2 ) -dimensional subspace to a ne-dimensional subspace. 


4 Algorithm Outline for Learning Mixture of Zero-Mean Gaus- 
sians 

In this section, we present our algorithm for learning zero-mean Gaussian mixture model. The 
algorithmic ideas and the analysis are at the core of this paper. Later we show that it is relatively 
easy to generalize the basic ideas and the techniques to handle the general case. 

For simplicity we state our algorithm using the exact moments M 4 and Mg, while in implemen¬ 
tation the empirical moments M 4 and Mg obtained with the samples are used. In later sections, 
we verify the correctness of the algorithm and show that it is robust: the algorithm learns the 
parameters up to arbitrary accuracy using polynomial number of samples. 

Step 1. Span Finding: Find the span of covariance matrices . 

(a) For a set of indices FI C [n] of size \FL\ = s/n, find the span: 

S = span jsgj : i € [As], j G Fi} C R n . (4) 

(b) Find the span of the covariance matrices with the columns projected onto S 2 -, namely, 

Us = span^vec(Proj s ±T,^) : i € [&] j C M" 2 . (5) 

(c) For two disjoint sets of indices FL\ and FL 2 , repeat Step 1 (a) and Step 1 (b) to obtainU\ and 
U 2 , namely the span of covariance matrices projected onto two subspaces S ^ and S^- Merge 
U\ and U 2 to obtain the span of covariance matrices U: 

Z4 = span{s w : i G [A:]} C M na . (6) 
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Step 2. Unfolding: Recover the unfolded moments X 4 ,Xq. 

Given the folded moments M 4 , Mq as defined in Q, and given the subspace U G M n2Xfc from Step 
1 , let Y 4 G Mj x ^ and Yq G be the unknowns, solve the following systems of linear equations. 

M 4 = V3X 4 (UY 4 U T ), Mq = V15X 6 (Yq{U t , U t , U t )). (7) 

The unfolded moments X 4 ,Xq are then given by X 4 = UY 4 U t , Xq = Yq(U t ,U T ,U T ). 

Step 3. Tensor Decomposition: learn Wi and E® from Y 4 and Yq. 

Given U, and given I 4 and Yq which are relate to the parameters as follows: 

k k 

Y 4 = Yq = ^W;(t/ T E«)<g> 3 , 

Z=1 Z=1 

we apply tensor decomposition techniques to recover EW ’ s and wy’s. 


5 Implementing the Steps for Mixture of Zero-Mean Gaussians 


In this part we show how to accomplish each step of the algorithm outlined in Section [4] and sketch 
the proof ideas. 

For each step, we first explain the detailed algorithm, and list the deterministic conditions on 
the underlying parameters as well as on the exact moments for the step to work correctly. Then we 
show that these deterministic conditions are satisfied with high probability over the p-perturbation 
of the parameters in the smoothed analysis setting. In order to analyze the sample complexity, we 
further show that when we are given the empirical moments which are close to the exact moments, 
the output of the step is also close to that in the exact case. 

In particular we show the correctness and the stability of each step in the algorithm with two 
main lemmas: the first lemma shows that with high probability over the random perturbation of 
the covariance matrices, the exact moments satisfy the deterministic conditions that ensure the 
correctness of each step; the second lemma shows that when the algorithm for each step works 
correctly, it is actually stable even when the moments are estimated from finite samples and have 
only inverse polynomial accuracy to the exact moments. 

The detailed proofs are deferred to Section [B] to [D] in the appendix. 


Step 1: Span Finding. Given the 4-th order moments M 4 , Step 1 finds the span of covariance 
matrices U as defined in <©• Note that by definition of the unfolded moments X 4 in Q, the 
subspace U coi ncid es with the column span of the matrix X 4 . 


By Lemma 3.7, we know that the entries in M 4 are linear mappings of entries in X 4 . Since the 


matrix X 4 is of low rank (k <C 712 ), this corresponds to the matrix sensing problem first studied 


m 


Recht et al. (2010). In general, matrix sensing problems can be hard even when we have many 


linear observations (Hardt et al. (2014b)). Previous works (Recht et al. (2010); Hardt et al. (2014a); 


Jain et al. (2013)) showed that if the linear mapping satisfy matrix RIP property, one can uniquely 


recover X 4 from M 4 . 

However, properties like RIP do not hold in our setting where the linear mapping is determined 
by Isserlis’ Theorem. We can construct two different mixtures of G aussians with different unfolded 
moments X 4 , but the same folded moment M 4 (see Section A.3). Therefore the existing matrix 
recovery algorithm cannot be applied, and we need to develop new tools by exploiting the special 
moment structure of Gaussian mixtures. 






























Step 1 (a). Find the Span of a Subset of Columns of the Covariance Matrices. The 

key observation for this step is that if we hit M 4 with three basis vectors, we get a vector that lies 
in the span of the columns of the covariance matrices: 


Claim 5.1. For a mixture of zero-mean Gaussians Q 
slices of the f-th order moments M 4 are given by: 

k 


M±(ej 


„i) = 


E 


Ui 


E W ■ S®. , + E (i) ■ 

31,32 [:j 3 ] 31,33 


i =1 


{( W »i 0 , &Gn,k, the one-dimensional 

S M + S £i, S £,])> VA.*.*6W- (8) 


In particular, if we pick the indices j 1 , J 2 ,. 7,3 hr the index set H, the vector M 4 (ej 1 ,ej 2 ,ej 3 ,1) 
lies in the desired span S = |e|*^ : i E [k],j E 4/ j. 

We shall partition the set % into three disjoint subsets 'w- 1 * of equal size y/n/ 3, and pick j t E H^> 
for i = 1,2,3. In this way, we have (|%|/3 ) 3 = fi(n 1-5 ) such one-dimensional slices of M 4 , which all 
lie in the desired subspace S. Moreover, the dimension of the subspace S is at most k\H\ <C n 1 ' 5 . 
Therefore, with the p-perturbed parameters sW’s, we can expect that with high probability the 
slices of M 4 span the entire subspace S. 

Condition 5.2 (Deterministic condition for Step 1 (a)). Let Qs E M nx (l^l / 3 ) 3 be the matrix whose 
columns are the vectors , e j2 , e j3 , I) for ji E 7-^. If the matrix Qs achieves its maximal 

column rank k\H\, we can find the desired span S defined in 0 by the column span of matrix Qs- 


We first show that this deterministic condition is satisfied with high probability by bounding 
the k\H\-t\i singular value of Qs with smoothed analysis. 

Lemma 5.3 (Correctness). Given the exact f-th order moments M 4 , for any index set H of size 
\H\ = y/n, With high probability, the k\H\-th singular value of Qs is at least kl{ui 0 p 2 n). 


The proof idea involves writing the matrix Qs as a product of three matrices, and using the 


results on spectral properties of random matrices Rudelson and Vershynin (2009) to show that with 


high probability the smallest singular value of each factor is lower bounded. 

Since this step only involves the singular value decomposition of the matrix Qs, we then use 
the standard matrix perturbation theory to show that this step is stable: 

Lemma 5.4 (Stability). Given the empirical estimator of the f-th order moments M 4 = M 4 + E 4 , 
suppose that the entries of E 4 have absolute value at most 5. Let the columns of matrix S E 
be the left singular vector of Qs, and let S be the corresponding matrix obtained with M 4 . When 
5 is inverse polynomially small, the distance between the two projections \\Proj§ — Projg || is upper 

bounded by O (n 1,25 


Remark 5.5. Note that we need the high dimension assumption (n k) to guarantee the cor¬ 
rectness of this step: in order to span the subspace S, the number of distinct vectors should be 
equal or larger than the dimension of the subspace, namely |%| 3 > k\H\; and the subspace should be 
non-trivial, namely k\H\ < n. These two inequalities suggest that we need n > f^/c 1 " 5 ). However, 
we used the stronger assumption n > Lt(k 2 ) to obtain the lower bound of the smallest singular value 
in the proof. 
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Step 1 (b). Find the Span of Projected Covariance Matrices. In this step, we continue 
to use the structural properties of the 4-th order moments. In particular, we look at the two- 
dimensional slices of M 4 obtained by hitting it with two basis vectors: 


Claim 5.6. For a mixture of zero-mean Gaussians Q = 
slices of the 4-th order moments M 4 are given by: 

k 


M 4 (e 


1 e j2 




OJi 


y W y(9 1 y 


40 




i= 1 


{(u;£, 0 , SW)} ie [ fc ] E G n ,k> the two-dimensional 
) T + sg 2 ] (E« i] ) T ), Vj 1 ,j 2 e[n}. (9) 


Note that if we take the indices j\ and j 2 in the index set PL, the slice M 4 (ej 1 , ej 2 ,1,1) is 
almost in the span of the covariance matrices, except 2 k additive rank-one terms in the form of 

y (®) 

fbl] 

obtained in Step 1 (a), namely, 

k 

vec(Proj 5 ±M 4 (e il ,e j2 ,/,/)) = ^WiS^ i ) j2 vec(Proj 5 xE w ), Vji,j 2 G PL, 

i =1 


(sf*^. ,) T . These rank-one terms can be eliminated by projecting the slice to the subspace 5^ 


and this projected two-dimensional slice lies in the desired span Us as defined in ([ 5 ]). Moreover, 
there are (^i +1 ) = fl(n) such projected two-dimensional slices, while the dimension of the desired 
span Us is at most k. 

Condition 5.7 (Deterministic condition for Step 1 (b)). Let Qu s G M n2X l^l(l^l+ 1 )/ 2 be a matrix 
whose (ji,j 2 )-th column for is equal to the projected two-dimensional slice vec{Proj s ± M 4 (e^,, e j2 .1,1)), 
for j 1 < j 2 and ji,j 2 E PL. If the matrix Qu s achieves its maximal column rank k, the desired span 
Us defined in © is given by the column span of the matrix Qu s ■ 


We show that this deterministic condition is satisfied by bounding the /c-th singular value of 
Qu s in the smoothed analysis setting: 

Lemma 5.8 (Correctness). Given the exact 4-th order moments M 4 , with high probability, the k-th 
singular value of Qjj s is at least Ll(oj 0 p 2 n 1 ' 5 ). 


Similar to Lemma 5.3 the proof is based on writing the matrix Qu s as a product of three 


matrices, then bound their k-th singular values using random matrix theory. The stability analysis 
also relies on the matrix perturbation theory. 

Lemma 5.9 (Stability). Given the empirical 4-th order moments M 4 = M 4 + E 4 , assume that the 
absolute value of entries of E 4 are at most d 2 . Also, given the output Projg ± from Step 1 (a), and 
assume that \\Proj§ ± — Projg ± \\ < S\. When <5i and 5 2 are inverse polynomially small, we have 

, 2.5 


II P r oju s ~ Pr °ju s W - 0 \ n (^2 + 2<5i) /a k {Qu ; 


Step 1 (c). Merge U\, U 2 to get the span of covariance matrices U. Note that for a given 
index set FI, the span Us obtained in Step 1 (b) only gives partial information about the span of 
the covariance matrices. The idea of getting the span of the full covariance matrices is to obtain 
two sets of such partial information and then merge them. 

In order to achieve that, we repeat Step 1 (a) and Step 1 (b) for two disjoint sets FL\ and Pi 2 , 
each of size y/n. The two subspace S\ and S 2 thus correspond to the span of two disjoint sets of 
covariance matrix columns. Therefore, we can hope that U\ and U 2 , the span of covariance matrices 
projected to S f and 5^ contain enough information to recover the full span U. 

In particular, we prove the following claim: 
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Condition 5.10 (Deterministic condition for Step 1 (c)). Let the columns of two (unknown) ma¬ 
trices V\ E M. nxk and Vi E M. nxk form two basis of the same k-dimensional (unknown) subspace 
U C M n , and let U denote an arbitrary orthonormal basis ofU. Given two s- dimensional subspaces 
S i and Si, denote S 3 = S(~ U S 2 . Given two projections ofU onto the two subspaces Sj and Sj: 
U\ = Proj s ±Vi and Ui = Proj s ±Vi. If ai s ({S\ , Si]) > 0 and ak(Proj S;j U) > 0, there is an algorithm 
for finding U robustly. 


The main idea in the proof is that since s is not too large, the two subspaces S(~ and S)~ have a 
large intersection. Using this intersection we can “align” the two basis V\ and V 2 and obtain V^Vi, 
and then it is easy to merge the two projections of the same matrix (instead of a subspace). 

Moreover, we show that when applying this result to the projected span of covariance ma¬ 
trices, we have s = k\H\ < n/3, and the two deterministic conditions o^sQSi, Si]) > 0 and 
a*,(Projs 3 Ui) > 0 are indeed satisfied with high probability over the parameter perturbation. The 
detailed smoothed analysis (Lemma B.13 and B.14) and the stability analysis (Lemma B.ll) are 
provided in Section |B.3| in the appendix. 


Step 2. Unfold the moments to get X4 and Xq. We show that given the span of covariance 
matrices U obtained from Step 1 , finding the unfolded moments X4, Xq is reduced to solving two 
systems of linear equations. 

Recall that the challenge of recovering X4 and Xq is that the two linear mappings J-4 and J-q 
defined in ([ 3 ]) are not linearly invertible. The key idea of this step is to make use of the span U to 
reduce the number of variables. Note that given the basis U E M. n2Xk of the span of the covariance 
matrices, we can represent each vectorized covariance matrix as IW = Ua W. Now Let Y4 E 
and I4 E M. kx Jf xk denote the unfolded moments in this new coordinate system: 

k k 

Y 4 := Y () = (g ) 3 . 

i =1 i=1 

Note that once we know Y4 and Yq, the unfolded moments X4 and Xq are given by X4 = UY4U T 
and Xq = Yq(U t , U T , U T ). Therefore, after changing the variable, we need to solve the two linear 
equation systems given in 0 with the variables Y4 and Yq. 

This change of variable significantly reduces the number of unknown variables. Note that the 
number of distinct entries in Y4 and Yq are ki = ( A: ^ 1 ) and £3 = ( fc g 2 ), respectively. Since ki < n4 

and k3 < uq, we can expect that the linear mapping from Y4 to M 4 and the one from Yq to Mq are 
linearly invertible. This argument is formalized below. 

Condition 5.11 (Deterministic condition for Step 2 ). Rewrite the two systems of linear equations 
in ([7]) in their canonical form and let H4 E M n4Xfc2 and Hq E M neXfc3 denote the coefficient matrices. 
We can obtain the unfolded moments X4 and Xq if the coefficient matrices have full column rank. 

We show with smoothed analysis that the smallest singular value of the two coefficient matrices 
are lower bounded with high probability: 

Lemma 5.12 (Correctness). With high probability over the parameter random perturbation, the 
ki-th singular value of the coefficient matrix H4 is at least Ll(p 2 n/k), and the singular value 
of the coefficient matrix Hq is at least D(/j 3 (n//c) 1 ' 5 ). 

To prove this lemma we rewrite the coefficient matrix as product of two matrices and bound 
their smallest singular values separately. One of the two matrices corresponds to a projection of 
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Figure 1: Flow of the algorithm for learning mixture of zero-mean Gaussians. 


the Kronecker product E. In the smoothed analysis setting, this matrix is not necessarily 

incoherent. In order to provide a lower bound to its smallest singular value, we further apply a 
carefully designed projection to it, and then we use the concentration bounds for Gaussian chaoses 
to show that after the projection its columns are incoherent, finally we apply Gershgorin’s Theorem 
to bound the smallest singular value [^] 

When implementing this step with the empirical moments, we solve two least squares problems 
instead of solving the system of linear equations. Again using results in matrix perturbation theory 
and using the lower bound of the smallest singular values of the two coefficient matrices, we show 
the stability of the solution to the least squares problems: 

Lemma 5.13 (Stability). Given the empirical moments M 4 = M4 + E4, Mq = Mq + Eq, and 
suppose that the absolute value of entries of E\ and Eq are at most 5 \. Let U, the output of Step 1 , 
be the estimation for the span of the covariance matrices, and suppose that \\JJ — U\\ < 62 - Let Y 4 
and Yq be the least squares solution respectively. When 5i and 62 are inverse polynomially small, 
we have \\Y 4 - Y 4 \\ F < 0(y/ru(5i + d 2 /<T min {H 4 )) and ||Y 6 - Yq\\ f < 0(y/n^(6i + 5 2 / a m in{Ha))- 


Step 3. Tensor Decomposition. 

Claim 5.14. Given Y4, Yq and U, the symmetric tensor decomposition algorithm can correctly and 
robustly find the mixing weights u>i’s and the vectors eq ’ s, up to some unknown permutation over 
[k], with high probability over both the randomized algorithm and the parameter perturbation. 

The algorithm and its analysis mostly follow the algorithm of symmetric tensor decomposition 

, and the details are provided in Section [D] in the appendix. 


m 


Anandkumar et al. (2014 


Proof Sketch for the Main Theorem of Zero-mean Case. Theorem 13.51 follows from the 
previous smoothed analysis and stability analysis lemmas for each step. 

First, exploiting the randomness of parameter perturbation, the smoothed analysis lemmas 
show that the deterministic conditions, which guarantee the correctness of each step, are satisfied 
with high probability. Then using concentration bounds of Gaussian variables, we show that with 
high probability over the random samples, the empirical moments M4 and Mq are entrywise 5-close 
to the exact moments M4 and Mq. In order to achieve e accuracy in the parameter estimation, we 
choose 5 to be inverse polynomially small, and therefore the number of samples required will be 
polynomial in the relevant parameters. The stability lemmas show how the errors propagate only 
“polynomially” through the steps of the algorithm, which is visualized in Figure [lj 

A more detailed illustration is provided in Section [E] in the appendix. 


6 Note that the idea of unfolding using system of linear equations also appeared in the work of Jain and Oh (20141. 
However, in order to show the system of linear equations in their setup is robust, i.e., the coefficient matrix has full 
rank, they heavily rely on the incoherence assumption, which we do not impose in the smoothed analysis setting. 


Jain and Oh 


(2014 
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Figure 2: Flow of the algorithm for learning mixtures of general Gaussians. 


6 Algorithm Outline for Learning Mixture of General Gaussians 

In this section, we briefly discuss the algorithm for learning mixture of general Gaussians. Figure [2] 
shows the inputs and outputs of each step in this algorithm. Many steps share similar ideas to 
those of the algorithm for the zero-mean case in previous sections. We only highlight the basic 
ideas and defer the details to Section [F] in the appendix. 

Step 1. Find Z = span{Jl W : i € [A;]} and S D = span{ Proj^ ± EWProj^ x : i € [&:]}. Similar 
to Step 1 in the zero-mean case, this step makes use of the structure of the 4-th order moments 
M 4 , and is achieved in three small steps: 

(a) For a subset % C [n] of size \H\ = \fn , find the span: 

S = span j//W, : i G [k\,j 6 PA j C M n . (10) 

(b) Find the span of the covariance matrices with the columns projected onto 5 -1 -, namely, 

Us = span |vec(Proj 5 ±X^) : i G [fc]| C M n . (11) 

(c) For disjoint subsets T~L\ and PA 2 , repeat Step 1 (a) and Step 1 (b) to obtain U\ and P/ 2 , the 

span of the covariance matrices projected onto the subspaces 5^ and S^ ■ The intersection 
of the two subspaces U\ and U2 gives the span of the mean vectors Z = span{/rM,i € [fc]}. 
Merge the two subspaces U\ and U2 to obtain the span of the covariance matrices projected 
to the subspace orthogonal to Z , namely £ 0 = span jProj ^E^Proj : i G [fc]|. 

Step 2. Find the Covariance Matrices in the Subspace Z 1 - and the Mixing Weights 

uii’s. The key observation of this step is that when the samples are projected to the subspace 
orthogonal to all the mean vectors, they are equivalent to samples from a mixture of zero-mean 
Gaussians with covariance matrices = Proj^X^Proj^x and with the same mixing weights 
u)iS. Therefore, projecting the samples to 2^ , the subspace orthogonal to the mean vectors, and 

~(i) 

use the algorithm for the zero-mean case, we can obtain So ’s, the covariance matrices projected 
to this subspace, as well as the mixing weights oij’s. 

Step 3. Find the means With simple algebra, this step extracts the projected covariance 
matrices S^’s from the 3-rd order moments M 3 , the mixing weights uii and the projected covariance 
matrices S^’s obtained in Step 2 . 
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~(i) 

Step 4. Find the full covariance matrices In Step 2, we obtained So , the covariance 
matrices projected to the subspace orthogonal to all the means. Note that they are equal to 
matrices (E^ + projected to the same subspace. We claim that if we can find the 

span of these matrices ((S^ + we can get each matrix (XW +/rW(/fW) T ), and then 

subtracting the known rank-one component to find the covariance matrix E®. This is similar to 
the idea of merging two projections of the same subspace in Step 1 (c) for the zero-mean case. 

The idea of finding the desired span is to construct a 4-th order tensor: 

k 

Mj[ = M4 + 2^w i (£«® 4 ), 

i=1 

which corresponds to the 4-th order moments of a mixture of zero-mean Gaussians with covariance 
matrices SW an d the same mixing weights 5j’s. Then we can then use Step 1 of the 

algorithm for the zero-mean case to obtain the span of the new covariance matrices, i.e. span{Ti ® + 

/x«(/ZW) T :*€ [Jfc]}. 

7 Conclusion 

In this paper we give the first efficient algorithm for learning mixture of general Gaussians in the 
smoothed analysis setting. In the algorithm we developed new ways of extracting information from 
lower-order moment structure. This suggests that although the method of moments often involves 
solving systems of polynomial equations that are intractable in general, for natural models there is 
still hope of utilizing their special structure to obtain algebraic solution. 

Smoothed analysis is a very useful way of avoiding degenerate examples in analyzing algorithms. 
In the analysis, we proved several new results for bounding the smallest singular values of structured 
random matrices. We believe the lemmas and techniques can be useful in more general settings. 

Our algorithm uses only up to 6-tlr order moments. We conjecture that using higher order 
moments can reduce the number of dimension required to n > 0(/c 1+e ), or maybe even n > Q(k e ). 
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A Moment Structures 


In this section we characterize the structure of the 3-rd, 4-th and 6 -th moments of Gaussians 
mixtures. 


As described in Section 3.2, the m-th order moments of the Gaussian mixture model are given 
by the following m-th order symmetric tensor M e 


® [ x ji ■ ■ ■ x jm] ~ UjE 


i =1 


(0 

4 




Vji, • •. ,j m e N, 


where y W corresponds to the n-dimensional Gaussian distribution _A/"(/i®, I]W). 

Gaussian distribution is a highly symmetric distribution, and in the zero-mean case the higher 
moments are well-understood by Isserlis’ Theorem: 

Theorem A.l (Isserlis). Let y = {y \,..., y 2 t) be a multivariate Gaussian random vector with mean 
zero and covariance £, then 


%i • - - 2/2*] = 

E[yi. - - 2/ 2 *-i] = 0, 

where the summation is taken over all distinct ways of partitioning yi,... ,y 2 t Vrfo t pairs, which 
correspond to all the perfect matchings in a complete graph. Thus there are (2 1 — 1)!! terms in the 
sum, and each summand is a product of t terms. 

The non-zero mean case is a direct corollary using Isserlis’ Theorem and linearity of expectation. 

Corollary A.2. Let y = (yi,..., yt) be a multivariate Gaussian random vector with mean p and 
covariance £, then 

E[yi ... yt] = 'f ' | Tj U jV [ | p, w . 

where the summation is taken over all distinct ways of partitioning y\,... ,yt into p pairs of (u,v) 
and s singletons of (w), where p > 0, s > 0 and 2 p + s = t. 

As an example, E[yiy 2 y 3 ] = P 1 P 2 P 3 + /UiS 2 ,3 + ^ 2 ^ 1,3 + p 3 ^i, 2 - 


A.l Proof of Lemma 13.71 


We shall first prove Lemma 3.7 in Section 3.2 Recall that this lemma shows that for mixture of 
zero-mean Gaussians, the 4-th moments M 4 and the 6 -th moments Mq with distinct indices can 
be viewed as a linear projection of the unfolded moment X 4 and Xq defined in ([Tj). 


Proof, (of Lemma 3.7) 


By Isserlis Theorem A.l, the mapping \Z 3 J -4 is characterized by: (VI < j\ < j 2 < J 3 < ji < n) 


m = E"<(V‘.LV’,L+ e 


■>(*) y(®) _i_ y(*) y(*) > 

^31,33 32,3A _r jl,H 32,j3> 


i= 1 


[-^"4] (j! ,j 2 ),(j3 ,3 a) A 4 ](ilJ 3 ),(j2j4) + A 4 ](il J4),(j2 J 3 ) ' 
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Therefore, with the normalization constant \/3, the (ji, J 2 , J 3 , j 4 )-th mapping of T\ is a projection 
of the three elements in X 4 . Similarly, we have for \/l5Fo: (VI < ji < j '2 < • • • < j6 < n) 

[M 6 }n ,32 , 3334 , 35 ,36 

= [^6](j! J 2 ),(i 3 J 4 ),(i5 J 6 ) ,j 3 ,(j 2 J4),0'5 J6) + [^ 6 ](il J4),0'2 ,i3),0'5 J6) + [^] (ji J s ),(j 2 J 3 ),(j4 J6) 

+ [-^Q)] (jl ,J2),(J5 J3),(j4,i6) [^6](jiJ 3 ),(j2,i5),0'4,J6) [^ 6 1 (jl >J2),(j4 ,35),{j3,36) [^ 6 1 O'l,J4),(j2,J5),0 _ 3,J6) 

+ [^6](ji,J 6 ), 0 _ 2 ,J4 ), 0 _ 3, J6 ) + ["^s] (j 1 , J 3 ), (j 4 J 5 ), (j 2 J6 ) + t^] (jx J 4 ),(j 3 Jb),(32,3b) + t^] (ji ,j 5 ),(j 3 ,j 2 ),(j2 ,je) 

+ [^6 ](j 2 J 3 ),(j4J 5 ),(ji,j 6 ) + [^6 ] (i2 ,V4 ), (V 3 ,J5 ), OT ,36) + [^ 6 ] (V2, V5 ), 0' 3 ,J4 ), (Vl ,J6 ) ‘ 

Thus with the normalization constant \/l5, the mapping J-q is a linear projection. □ 


A.2 Slices of Moments 

Next we shall characterize the slices of the moments of mixture of Gaussians. 

For mixture of zero-mean Gaussians, a one-dimensional slice of the 4th moment tensor is a 
vector in the span of corresponding columns of the covariance matrices: 

Claim A.3 (Claim [5T] restated). For a mixture of zero-mean Gaussians, the one-dimensional slices 
of the 4-th moments M 4 are given by: 


M 4 (e 


311 21 V73 


> J ) = ( 




y(®) 

31,32 [:j 3 ] 


+ E (i) • E®. , + E W • sf i} . , 

31,33 [., 32 ] 32,33 [:, 3 1 ] 


Vjl, J2, J3 G N- 


2—1 


Proof. By the definition of multilinear map, M 4 (ejj, ej 2 , ej 3 , /) is a vector whose p-th entry is equal 
to M 4 (ej 15 ej 2 , ej 3 , e p ). We can compute this entry by Isserlis’ Theorem: 


M 4 (ej 


>) = E 


w,: 


i=l 


y4*) yj(®) 

ilJ2 [p,j 3 ] 


+ 


y4®) y(®) 
31,33 [p,j 2 ] 


+ E W • . , 

32,33 \p,3l\ 


this directly implies the claim. □ 

For mixture of zero-mean Gaussians, a two-dimensional slice of the 4th moment M 4 is a matrix, 
and it is a linear combination of the covariance matrices with some additive rank one matrices: 

Claim A.4 (Claim [543| restated). For a mixture of zero-mean Gaussians, the two-dimensional slices 
of the 4-th moment M 4 are given by: 


M^ej 




Ui 


i=1 


V®) 

V 1 ,32 


E( ‘ ) + E S,i< E S=i 


T + , 

[ ; J2] 


( E Si]) T )’ 


Proof. Again this follows from Isserlis’ theorem. By definition of multilinear map this is a matrix 
whose (p, g)-th entry is equal to 


M 4 (e 


311 °J 2 j °p> 


e 0 = 


E 

2=1 


^2 


yi(*) y(®) 

^'142^^91 


+ 


y(®) y(®) 


+ 


y(®) y(®) 
^32,P^[q,j 1 ] 


and this directly implies the claim. 

□ 
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Similarly, for mixture of general Gaussians, we prove the following claims: 


Claim A.5 (Claim F.l restated). For a mixture of general Gaussians, the (j\, j‘ 2 ,j:i)-th one¬ 
dimensional slice 0 /M 4 is given by: 


M±( ejl ,e h ,e j 3 ,I) = $2 $3 ^ + 


i= 1 


\ v(*) + uWiil’lE® + £(») u (i)n(i) 

/ y V^^TTl ,7T2 ^[:,7r 3 ] ' r 1 7T1 A* 7T2 ^[:,7T 3 ] ' ^TTl ,7T2 ”773 


'O'l J 2 J 3 ), 

Tre-j (j2J3,jl), 
. C?3 Jl J 2 ) 


Proof. This is very similar to Claim 5.1 and follows from the corollary of Isserlis’s theorem (Corol¬ 
lary A.2). There are 10 ways to partition the indices {ji, ji, J 3 ,j 4 } into pairs and singletons: 
((ji), (h), (j' 3 ), (j' 4 )), ((ji, 32), C73), O 4 )), (0i,j 3 ), O 2 ), (j4)), ((ji, j' 4 ), (j' 2 ), (j' 3 )), ((32,33), (ji), (ji)), 
((h,ji),(ji),(j 3 )), ((33,34), (31), (32)), ((ji,32), (j3) J 4 ))) ((ji,j 3 ), (J2,J4)), (0‘1,J4), (j 2 , j 3 ))- From 
this enumeration, we can specify each element in the vector of the one-dimensional slice. □ 


Claim A.6 (Claim F.4 restated). For a mixture of general Gaussians, let the matrix M 3(1 j E M nxr 


be the matricization of M 3 along the first dimension. The j-th row o/M 3 ( 1 ) is given by: 

k 

[ M 3(i)k:] = (4* } wec(S W ) + + h 


2—1 


■<M 3 i 


Proof. Note that [M 3 ndu.i = vec(E[xjaxc ]) = vec(E[xja: 0 x]). Again following the corollary of 
Isserlis’s theorem (Corollary A.2 there are 4 ways to partition the indices {ji,j2, J 3 } into pairs and 
singletons: ((ji), (j 2 , J 3 )), ((ji), (j' 2 ), (j 3 )), (O'l, J 2 ), (j' 3 )), ((j' 2 ), (ji, j 3 ), and they correspond to the 
four terms in the summation.) □ 


A.3 Two mixtures with same M 4 but different X 4 

Since M 4 gives linear observations on the symmetric low rank matrix X 4 . it is natural to wonder 
whether we can use matrix completion techniques to recover X 4 from M 4 . Here we show this is 
impossible by giving a counter example: there are two mixture of Gaussians that generates the 
same 4th moment M 4 , but has different X 4 (even the span of E'*'’ s are different). 

By ((a, b ), (c, d)) we denote a 5 X 5 matrix A which has 2’s on diagonals, and the only nonzero 
off-diagonal entries are A a ^ = Ab t a = A Ct d = Ad >c = 1- For example, ((1, 2), (4,5)) will be the 
following matrix: 

(21 \ 

1 2 

2 

2 1 

V 12 / 

where all the missing entries are 0’s. Now we construct two mixtures of 3 Gaussians, all with mean 0 
and weight 1/3. The covariance matrices are ((1,2), (4,5)), ((1,3), (2,5)), ((1,4), (3,5)) for the first 
mixture and ((1, 2), (3, 5)), ((1,3), (4, 5)), ((1,4), (2, 5)) for the second mixture. These are clearly 
different mixtures with different span of EW’s: in the first mixture, E)^ = £45 for all matrices, 
but this is not true for the second mixture. 

These two mixture of Gaussians have the same 4th moment M 4 . This can be checked by using 
Isserlis’ theorem to compute the moments. Intuitively, this is true because all the pairs (1 ,i) 
and (i, 5) appeared exactly twice in the covariance matrices for both mixtures; also, every 4-tuple 
(l,i,j, 5) appeared exactly once in the covariance matrices for both mixtures. 
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B Step 1: Span Finding 


Recall that in Step 1 of the algorithm for learning mixture of zero-mean Gaussians, we find the 
span of the covariance matrices in three small steps. In this section, we prove the correctness and 
the robustness of each step with smoothed analysis. 

For completeness we restate each substep and highlight the key properties we need, followed by 
the detailed proofs. 

B.l Step 1(a). Finding S, the span of a subset of columns of S^’s. 


Input: 4-th order moments A/ 4 , set of indices Ji. 

Output: span{Tj P : i E [k ], j E Ji}, represented by an orthonormal matrix S E R nx l^l fc . 

Let Q be a matrix of dimension n x \Ji\ 3 whose columns are all of ej 2 , ej 3 , /), for 

* 1 ,*2 , *3 G It. 

Compute the SVD of Q: Q = UDV T . 

Return: The first k\Ji\ left singular vectors S = [C7[ : ,i], ■ • •, £/[ : ,fc|w|]]- 

Algorithm 1: FindColumnSpan 


In Step 1 (a), for any set Ji of size y/n, we want to show that the one-dimensional slices of 
A /4 span the entire subspace S = span : i G [k\, j E Ji |, which is the span of a subset of the 

columns in the covariance matrices. 

Recall that in Claim O we showed: 


M±(ej 


■ o-E 


UJi 


yW y(®) 

' T-.R ":,R 


1 y(0 y(0 


I v(') vW 


Vjl,j2,j3 G N- 


2=1 


This in particular means when Ji,J 2 jJ 3 G Ji, the vector M^e^,, e j2 . e,j 3 , /) is in 5. We need to 
show that the columns of the matrix Q indeed span the entire subspace S. 

It is sufficient to show that a subset of the column span the entire subspace. Form a three-way 
even partition of the set Ji, i.e., {Ji^] = fJL^l = |"R^| = \Ji\/3 = y/n/3, and only consider the 
one-dimensional slices of A /4 corresponding to the indices j t E Ji^ for i = 1,2, 3. In particular, we 
define matrix Qs with these one-dimensional slices of A/ 4 : 


Qs 


[[M 4 ( ejl ,e j2 ,e j3 ,I) : j 3 E Ji^} : j 2 E Ji^] : ji E 7/ (1) 


£ ]g> n x(IW 3 ) 3 t 


( 12 ) 


Define matrix P$ with the corresponding columns of the covariance matrices, forming a basis 
(although not orthogonal) of the desired subspace S: 


Ps 


[Pg] :R [k]} :jen «] : Z = 1,2,3 




£ j^nxfc|H| 


(13) 


In the following two lemmas, we show that with high probability over the random perturbation, 
the column span of Qs is exactly equal to the column span of Ps, and robustly so. 


Lemma B.l (Lemma 5.3 restated). Given A/ 4 , the exact 4-th order moment of the p-smooth 
mixture of zero-mean Gaussians, for any subset Ji E [n] with cardinality \Ji\ = y/n, let Qs be the 


matrix defined as in (12) with the one-dimensional slices of A/ 4 . For any e > 0, and for some 
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k\n w \ fc|^ (2) | k\n {3) \ 

|H (2) ||H (3) | IT 


|H (1) ||H (2) ||^ (3) | 





Figure 3: Structure of the matrix B$ 

absolute constant Cd, 62,63 > 0, with probability at least 1 — ( Cie) C2n , the k\H\-th singular value 
of Qs is bounded below by: 

cr k \H\(Qs)>C 3 uj 0 e 2 p 2 n. (14) 

In order to prove this lemma, we first write Qs as the product of three matrices. 

the matrix Qs can be written 


Claim B.2 (Structural). Under the same assumptions of Lemma 
as 

T 


B.l 


Qs = Ps {Du ®kr I\H \) (PS), 


(15) 


where Ps € M nxfc ^ as defined in Equation (13 has columns equal to the columns in e! ! L; the 


diagonal matrix in the middle is the Kronecker product of two diagonal matrices and depends only 
on the mixing weights u>i’s. 

With the observation that the columns of Ps form a basis of the subspace S, and each column 
of Qs is a linear combination of the columns in Ps, the rows of Bs € Rd^l/ 3 ) xk \U\ can be viewed 
as the coefficients for the linear combinations, and has some special structures. In particular, it 
consists of three blocks: Bs = B^ 1 \b^ 2 \B^ . The first tall matrix B^ 6 M(l’^l/ 3 ) 3 xfc H’ w l/ 3 ) ) 

corresponding to the coefficient of the linear combinations on the subset of basis Ej. ^(i)y By the 
indexing order of the columns in Qs, the matrix L>U) is block diagonal with identical blocks equal 
to E^( 2 ) ^( 3 ), defined as follows: 


C ?d 2 ),7d 3 ) 


01 e W 2 | ,j 2 
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With some fixed and known row permutation -k^ 2 ) and the matrix B G) and B G) can be made 
block diagonal with identical blocks equal to E^p) ^( 3 ) and E W ( 1 )^( 2 ), respectively. Note that the 
three parts B^\ B^ 2 \ IT 3 ) do not have any common entry, nor do they involve any diagonal entry 
of the covariance matrices, therefore the three parts are independent when the covariances are 
randomly perturbed in the smoothed analysis. 

It is easier to understand the structure by picture, see Figure [3j The rows of the matrix should 
be indexed by (ji,£ L^ 1 ) X B^ 2 ' x B^ 3 \ which can also be interpreted as a cube (in the 
right). The block structure in the first part B G) correspond to a slice in B G) x B G) direction (for 
each block, the element in B G) is fixed, the elements in B G) and B G) take all possible values). 
Similarly for I?G) and £>G) (as shown in figure). 


Proof, (of Claim B.2| ) The proof of this claim is using Claim 5.1 the definition of matrices and the 
rule of matrix multiplication. Consider the column in Qs corresponding to the index (j \, ]2, j:i ) for 
ji E B^\j2 G B^ 2 \js E B^ 3 ', and the row of B$ together with the mixing wights specifies how this 
column is formed as a linear combination of 3k columns of P$. By the structure of IW 4 in Claim 5.1 
the (ji, J 2 , j 3 )-th row of B G) has exactly k entries corresponding to E^ ■ for j e [fc], these entries 


are multiplied bv un in the middle (diagonal) matrix. Therefore, these directly correspond to the k 
terms in Claim 5.1 Similarly the entries in B G) and B G) correspond to the other 2 k terms. □ 


Using Claim [B ~2 we need to bound the smallest singular value for each of the matrices in order 
to bound the k\B\-th. singular value of Qs, this is deferred to the end of this part. The most 


important tool is a corollary (Lemma G.16) of the random matrix result proved in Rudelson and 


Vershynin (2009), which gives a lowerbound on the smallest singular value of perturbed rectangular 


matrices. 

By Lemma 


B.l 


we know Qs has exactly rank k\B\, and is robust in the sense that its k\B\- 
th singular value is large (polynomial in the amount of perturbation p). By standard matrix 
perturbation theory, if we get Qs close to Q 5 up to a high accuracy (inverse polynomial in the 
relevant parameters), the top k\B\ singular vectors will span a subspace that is very close to the 
span of Qs- We formalize this in the following lemma. 


Lemma B.3 (Lemma 5.4 restated). Given the empirical estimator of the 4-th order moments 
M 4 = M 4 + E 4 . and suppose that the absolute value of entries of E± are at most 5. Let the columns 
of matrix S E be the left singular vector of Qs, and let S be the corresponding matrix 

obtained with M 4 . Conditioned on the high probability event <Jk\n\(Qs) > f or some absolute 
constant C we have: 


II Projg - Projg\\ < 


Cn 1 


.25 


a k\H\(Qs) 


- 6 . 


(16) 


Proof. Note that the columns of S are the leading left singular vectors of Qs- We apply the 
standard matrix perturbation bound of singular vectors. Recall that S is defined to be the first 
k\B\ left singular vector of Qs, and we have 

II Qs ~ Qs\\ < II Qs ~ Qs\\f < y/n{\ B\/3)H 2 . 


Therefore by Wedin’s Theorem (in particular the corollary Lemma G.5| ), we can conclude (16). □ 

Next, we prove Lemma B.l| 


23 



























B.l 


We first use Claim 


B.2 


to write Qs = Ps (A 


/ 1 ^|) (Bs) T , note that 


Proof of Lemma 

the matrix ( D% I\n \) has dimension k\H\ x k\H\, therefore we just need to show with high 

probability each of the three factor matrix has large &|'W|-th singular value, and that implies a 
bound on the k\H\-t\i singular value of Qg by union bound. The smallest singular value of Ps and 
Bs are bounded below by the following two Claims. 

Claim B.4. With high probability Uk\n\(Ps) > Sl(p\/n). 

Proof. This claim is easy as Ps € R nx *AI is a tall matrix with n > 5k\T-L\ rows. In particular, let 
P' s be the block of Pg with rows restricted to T~L C = [n]\H. Note that P' s is a linear projection 

the k\H\ singular values of P' s 


G.ll 


of Vs, and by basic property of singular values in Lemma 
provide lower bounds for the corresponding ones of Pg. We only consider the restricted rows so 
that P' s does not involve any diagonal elements of the covariance matrices, which are not randomly 
perturbed in our smoothed analysis framework. 

Now Pg is a randomly perturbed rectangular matrix, whose smallest singular value can be lower 
bounded using Lemma G.16, and we conclude that with probability at least 1 — (Ce)°' 25n , 


□ 


Vk\H\( p s) > epy/n. 

Next, we bound the smallest singular value of Bg- 
Claim B.5. With high probability a^fBs) > Q(p-^n). 

Proof. We make use of the special structure of the three blocks of Bg to lower bound its smallest 
singular value. 

First, we prove that the block diagonal matrix & 1 1 has large singular values, even after pro¬ 
jecting to the orthogonal subspace of the column span of B^ and B^\ This idea appeared several 


times in our proof and is abstracted in Lemma G.12 Apply the lemma and we have: 


a k \ H \(B s )>mmla k{2 \ H]/3) ([B^\B^]), <r fc (Proj ([5(2))5(3 ) ] xw( 2 )xw( 3 )) J- E «( a ),«(3)) : J e 


> min 


<7fe(2|«|/3)([£ (2) .£ (3) ])> <T*( Proj ([5(2) g(3) ] 




•h ( 1) j 

(17) 


where the j-th block of [A 2 \ A 3) ] has dimension (|77|/3 ) 2 x 2k\7i\/3. Since 

(\T~L\/3) 2 - k - 2k\H\/3 = ft(n/9 — k — 2kn os /3) > fI(n), 

this means for each block, even after projection it has more than 3k rows. Note that by definition 
the three blocks B^\ B l 2 ) and A 3 ) are independent and do not involve any diagonal elements 
of the covariance matrices, so each block after the two projections is again a rectangular random 


matrix. We can apply Lemma G.15 for any j, for some absolute constant C \, A, C 3 (not fixed 
throughout the discussion), with probability at least 1 — (Cie) C2n over the randomness of 2 ) -^( 3 ), 
we have: 


ak ( Pro i([_B( 2 ),b(3) ]{ , }xk(2)x n(3) p Proj ^ (2)iW(3) ^ 


(18) 


e « (1) 
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Now we can take a union bound over the blocks and conclude that with high probability, the 
smallest singular value of each block is large. 

In order to bound ^k(2\'H\/3)([ B ^ 2 \ we use the same strategy. Note that B l 2 ) also has a 

block structure that corresponds to the B^ x B^ faces (see Figure [3]). Again check the condition 
on dimension (\B\/3) 2 — k — k\B\/3 > f2(n) > 3k, we can apply Lemma G.12 again to show that 
for any j, with probability at least 1 — (Cie) C2n over the randomness of E^(i) ^(3), we have: 


x ± Proj v ± XL/m v(3)) : j £ B^}. 

«(DxO}x«(3)) h(D.w(3) h ’ n ’ J 

(19) 


« r *(2|W|/3)([5 (2) ,s (3) ]) > min{(7 fe( |^| / 3 ) (S (3) ), a fc (Proj {[S(3)] ^ (i 

for any j, with probability at least 1 — (Cje) C2?l over the randomness of 

S W (i) ^( 3 )) > epsJCxn. 


G.15 


Again by Lemma 
^7^(1) 7^(3)) we have: 

^( p r°j([s(3)] K(1) 


^ ± Proj v _L 


( 20 ) 


Finally, for B^ it is a block diagonal structure with blocks correspond to 77-'i x B^ faces (see 
Figure |3]). Each block is a perturbed rectangular matrix, therefore we apply Lemma G.15 to have 
that with high probability over the randomness of E-^(i) ^(2), 


( ^k(\'H\/3) (-B (3) ) > <Tjfc(£ W (i) >W (a)) > epy/n. 


( 21 ) 


Now plug in the lower bounds in (18) (20) (21) into the inequalities in ( |17[ ) and (19). By union 
bound we conclude that with high probability: 


°k\H\{ B s) > ep^Opn. 


□ 


Finally, the diagonal matrix in the middle is given by the Kronecker product of 1^ and D%. 
Recall that D% is the diagonal matrix with the mixing weights uVs on its diagonal. By property 
of Kronecker product and the assumption on the mixing weights, the smallest diagonal element of 
Du ®kr I\h\ i s least ojq. Therefore (Jk\n\{Du ®kr I\n\) — w o- 

We have shown that the smallest singular value of all the three factor matrices are large with 
high probability. Therefore, apply union bound, we conclude that with probability at least 1 — 
exp(—fl(n)), 

<7k\H\(Qs) > Vk\H\(Ps)Vk\H\( D u ®kr I\n\) a k\H\( B s) > 0{uj 0 p 2 n). 


B.2 Step 1 (b). Finding Us, the span of E^’s with columns projected to S^. 

In Step 1 (b), given the subset of indices B and the subspace S obtained in Step 1 (a), we want to 
show that the projected two-dimensional slices of M4 span the subspace Us defined in ([5]), which 
is the span of the covariance matrices with the columns projected the subspace S 


Us = span |vec(Proj5±sW) : i 6 [fc] j 


C 


Recall that in Claim 5.6 we characterized the two dimensional slices of the 4-th moments M4 
of mixture of zero-mean Gaussians as below: 

Mt( ejl ,e ja ,I,I) = ^ (E^ 2 E (i) + sjJ i] (S« a] ) T + S« a] (S« i] ) T ) , V),, j 2 € [n]. (22) 

i=l 
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Input: 4-th order moments M 4 , set of indices PL, subspace S C M n 

Output: span{ vec(Projg±X®) : i G [A:]}, represented by an orthonormal matrix 

U s eR n2xk . 

Let Q be a matrix whose columns are vec(Proj 5 xM 4 (ej, ej, I, /)) for all i, j G PL, i 7^ j. 
Compute the SVD of Q: Q = UDV T . 

Return: The first k left singular vectors Us = [L/[ : ,i]? • • ■, U[-.,k]\- 
Algorithm 2: FindProjectedSigmaSpan 


For notational convenience, we let J denote the set J = {(ji ,.72) : j 1 < j 2, ji,j 2 G %}, and 
note that the cardinality is |i7| = (^ +1 ) = ( n + \/n)/2. First, we define the matrix Qjj s G Mil'Ll 
whose columns are the vectorized two-dimensional slices of M 4 with the columns projected to the 
subspace 5^: 


Qu s = 


vec(Proj s xM 4 (e il ,e i2 ,/,/)) : G J 


(23) 


Similarly we define Qu 0 G M n2x l^l with the slices without the projection: 


Qu 0 = 


vec(M 4 (e il ,e i2 ,/,/)) : (ji, j 2 ) G J 


Observe the structure in ( 22 ) and we see the columns of Qu 0 is “almost” in the span of covariance 
matrices, except for some additive rank one terms. Note that all the rank one terms lie in the 
subspace S obtained from Step 1 (a), and they vanish if we project the slice to the orthogonal 

■ . . ~(i) ~ 

subspace S . In particular, Proj 5 x £j. L = 0 for all j G S. Let the columns of the matrix Pjj s G 
jgm 2 x k vectorized and projected covariance matrices as below: 


P U S = 


vec (Proj s iS«) : i G [k] 


(24) 


In the following claim, we show that the columns of Qu s indeed lie in the column span of Pjj s '■ 

Claim B.6. Given S obtained in Step 1(a), the span of for j G P and for all i, then for 
ji , j-i G %, we have: 

k 

Proj s ±M 4 (e jl ,e j 2 ,I,I) = ^uJi^^Proj^E^, Vji,j 2 G N- 

i =1 


Similar as in Step 1 (a), in the next lemma we show that the columns of Qu s indeed span the 
entire column span of Pij s ■ Since the dimension of the column span of Pu s is no larger than k, it 
is enough to the L-th singular value of Qu s : 


Lemma B.7 (Lenuna 5.8 restated). Given M 4 , the exact f-th order moment of the p-smooth 


mixture of Gaussians , define the matrix Qjj s as in (23) with the two-dimensional slices of M 4 . For 
any e > 0 , and for some absolute constant Ci,C 2 ,Cs > 0 , with probability at least 1 — 2 (Cie) C2n , 
the k-th singular value of Qjj s is bounded below by: 


°k(Qu s ) > C 3 uj 0 (ep) 2 7i 1 - 5 . 
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Similar as before, we first examine the structure of the matrix Qu s '■ 


Claim B.8 (Structural). Under the same assumption as Lemma BA, we can write Qu s 
following matrix product form: 


Qu s = -P[/ S -Cfc£j. 


( 25 ) 


The columns of the matrix Pij s E 


tn 2 xk 


are the vectorized and projected covariance matrices as 


defined in ( 24 ); D% is the diagonal matrix with the mixing weights Ui on its diagonal; and the 
matrix Yj is defined as: 


X, = 


vec ^Ui,j2) : C?i> Jz) e J\ ■ i e [k] 


xk 


Proof. This claim follows from Claim B.6, and the rule of matrix product. The coefficients cofY 


(0 

_ _ ji,h 

for the linear combinations of ve^Proj^uX^) are given by the columns of the product D^Y\. The 

coefficients are then multiplied by Pjj s to select the correct columns. □ 

To prove Lemma B.7[ similar to the proof ideas of Lemma B.l we lower bound the fc-th singular 
value of all the three factors. 


Proof of Lemma B.7 


T 


By the structural Claim B.8, we know the matrix Qu s can be written as 




a product of the three matrices as Qu s = Pu s ^n 

We lower bound the fc-th singular value of each of the three factors. It is easy for the last two 
matrices. Note that by assu mption ak{D~) > uj 0 , and since Xj is just a perturbed rectangular 
matrix, we can apply Lemma G.15 and with high probability we have a^Yj) > f l(py/n). 


The first matrix Pjj s is more subtle. Let us define the projection D s ± = Proj s _i_(g)j % r I n E M n Xn . 
This is just a way of saying “apply the projection Proj 5 _L to all columns” and then vectorize the 
matrix. In particular, for any matrix A we have D s ±vec(A) = ve^Proj^xH), therefore by definition 
of Pjj s we can write Pjj s = D S ±Y. 

However, we cannot apply the same trick to directly bound the smallest singular value of D s ± 
and Proj X separately. The problem here is that D s ± and X are not independent, as the subspace 

S obtained in Step 1(a) also depends on the perturbation on X, therefore Proj^ X is not simply 
a projected perturbed matrix. Instead, we show that even conditioned on the part of randomness 
that is common in S and X, X still has sufficient randomness due to the high dimensions, and we 
can still extract a tall random matrix out of it. This is elaborated in the following claim: 


Claim B.9. Under the assumptions of Lemma B. 7 with high probability the matrix Pu s = D S ±Y 
has smallest singular value at least Q(pn). 


Let L be the set of the (ji, j 2 )-th entries of XW for all i and one of ji, ji is in the set Ti. By Step 
1(a), the subspace S' = span( S, ej : j E TL) is only dependent on the entries in C. Here we need 
to include the span of e/s for j E hi because the diagonal entries can depend on the other random 
perturbations. By adding the span of the vector e^’s for j E 77 the subspace remains invariant no 
matter how the diagonal entries change. 

Let Z = span(X, S' <S>kr In), and recall that the columns of X are the factorization of the 
unperturbed covariance matrices. The subspace Z has dimension no larger than \TL\ (k + 1 )n + k < 
n 2 /10, and depends on the randomness of C. 

Let X = Y+E where E is the random perturbation matrix. Now we condition on the randomness 
in C. By definition the subspace Z is deterministic conditional on C. However, even if we only 
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consider entries of E\C there are still at least ( n > n 2 /4 independent random variables. We 

shall show the randomness is enough to guarantee that the smallest singular value of Proj/j E is 
lower bounded with high probability conditioned on £: 

Vk{Pu s ) = VkiDs^) 

> (^(Proj^xE) 

= u fc (Proj 2 xE + Proj^xFl) 

= cr fc (Proj^xPl). 


Here we used the fact that projection to a subspace cannot increase the singular values (Lemma G.ll). 

Conditioned on the randomness of entries in £, E\C still has at least n 2 /4 random directions, 
while t he dim ension of the deterministic subspace Z is at most n 2 /10. Therefore we can apply 
again to argue that conditionally, for every e > 0, with probability at least 1—(C'ie) c ' 2n 


Lemma 


G.15 


we have: 


^k(Pus) > e pVC 3 n 2 . 

In summary, apply union bound and we can conclude that with probability at least 1 — (Cie) c ' 2n , 

Vk(Qus) = Vk{Pu s )°k(Dz)(Jk{P‘j) > C^u 0 {ep) 2 n lb . 


□ 

Next, we again use matrix perturbation bounds to prove the robustness of this step, which 
depends on the singular value decomposition of the matrix Qu s . 

Lemma B.10 (Lemma 5.13 restated). Given the empirical 4-th order moments M 4 = M 4 + E 4 , 
and given the output Projg ± from Step 1 (a). Suppose that \\Proj§± — Proj^ ± \\ < 61 , and suppose 
that the absolute value of entries of E 4 are at most 62 for 62 < \\Qu s \\f / Vn? ■ Conditioned on the 
high probability event Pk{Qu s ) > 0 , we have: 

n 2 - 5 (1 + 281 / 82 ) 


I \ Pro iu s - P™iu s \\ ^ 


vkiQus) 


-82. 


(26) 


Proof of Lemma 


B.10 


Note that the columns of Us are the leading left singular vectors of Qu s - 
We want to apply the perturbatio n bou nd of singular vectors. 

Similar to the proof of Lemma 


B.3 


we first need to bound the spectral distance between Qu s 


and Qu s - I n feet we w iH even bound the Frobenius norm difference: 


II Qu s -Qu s \\f = \\D s ±Qu 0 - D s ±Q Uo \\ F 

= W^S^iQuo - Quo ) + (As-L - Ds^Quo + (As-L - Ds^iQuo ~ Qu 0 )\\f 

< II-^S-lIIfIIQc/o - Qu 0 \\f + 2||D 5 x - D s ±\\f\\Q Uo \\f 

< Vrf\\DMQu 0 — Qu 0 \\f + SV^IlProjgx — Projgx||iH|Q [/ 0 \\f 

< n\jn 2 \J\8\ + 2y / n-\/ n 2 \J\ ||Projg ± - Projgx||F 

< n 2 ^(! + 2||Projg x - Proj^ x || 2 /^ 2 )^ 2 , 


where we used the assumption ||SW|| < 1 to bound ||Qc/ 0 I|f) used the upperbound on \\Qu 0 — 
QuqWf tojsound the term || (D s ± - D s ±)(Q Uo - Qu 0 )\\f < ||(Asr-L - D s ±)\\ F 8 2 y/n 2 \J\ < ||(5 5 x - 
) II II 11^5 anc l used the fact that Frobenius norm is sub-multiplicative. Apply Wedin’s 
Theorem (in particular the corollary Lemma G.5), we can conclude (26). □ 
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Figure 4: Step 1(c): Merging two subspaces. 

B.3 Step 1 (c). Finding U by Merging the Two Projected Span 


Input: two subspaces S\, £2 G M” xfcs , two subspaces U\, U 2 G M. n2xk (the span of covariance 
matrices projected to the corresponding 

Output: spanfiE W : i G [A;]}, represented by an orthonormal matrix U G W 1 xk . 

Let A be the first 2 ks left singular vectors of [,Sj, S 2 ]. 

Let S 3 be the hrst (n — 2 ks) left singular vectors of I — AA T . 

Let Q = [I ri 2 ,Proj (5 3 8 )fer/?i) Proj r/l ] T L r 2 , compute the SVD of Q. 

Return: matrix U, whose columns are the hrst k left singular vectors Q. 

Algorithm 3: MergeProjections 

Pick two disjoint sets of indices TLi,H 2 , and repeat Step 1 (a) and Step 1 (b) on each of them 
to get Sj~ and Uj for j = 1, 2. In Step 1 (c), we merge the two span U\ and U2 to get U. 

If we are given two projections Proj 5 xC/ and Proj^x U of a matrix U, and if the union of the 
two subspaces and 5^ have full rank, namely dim(Si U ^ 2 ) = n, then we can recover U by: 


Pr °js^ 

t 

p roj s xt/ 

Pr °j sf 


p roj s xC/ 


However, it is slightly different if we are given two projections of a subspace U, since a subspace 
can be equivalently represented by different orthonormal basis up to linear transformation. 

In particular, in our setting for j = 1, 2, we can write Uj = (Proj s x ®kr 4 n )EILj for some fixed 

but unknown full rank matrix Wj (which makes the columns of matrix SW) an orthonormal basis 
of U). Recall that we define S = [vec(Sw) : i G [A:]], and D s ± = Proj^x < 8 >kr In for j = 1, 2. 

The following Lemma shows that we can still robustly recover the subspace U if the two pro¬ 
jections have sufficiently large overlapping. The basic idea is to use the overlapping part to align 
the two basis of the subspace which the two projections act on. 

Lemma B.ll (Robustly merging two projections of an unknown subspace). This is the detailed 
statement of Condition \5.10 

Let the columns of two fixed but unknown matrices V\ G M nxfc and V 2 G M. nxk form two basis 
(not necessarily orthonormal) of the same k-dimensional fixed but unknown subspace U in M n . 

For two s-dimensional known subspaces S\ and S 2 , Let the columns of A be the first 2s singular 
vectors of [Si, 62 ]; let the columns of S 3 correspond to the first (n — 2 s) singular vectors of 
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(I n — Proj A ), therefore S 3 C (Si U S 2 ) 1 ' ■ Suppose that a^Projg U ) > 0 and that 02 s ([<Si, S 2 ]) > 0. 
Define matrices U\ = Proj s ±V 1 and U2 = Proj s ±V 2 and we know that UjU\ = UjU2 = Ik- 

We are given Si, S2 and U\ , U2, and suppose that for j = 1 , 2, we have \ Sj — Sj||.f < 5 S and 
||Uj ~ Uj\\ F < S u , for S s < 1 ,S U < 1. 

Let the columns of A be the first 2s singular vectors of [Si, S 2 ], and let the columns of S 3 be the 
first ( n — 2s) singular vectors of (I n — Proj^). Define matrix U G M nx2fc to be: 


U = 


u 2 , u 1 (sju 1 y(sju 2 ) 


( 27 ) 


If crk(Proj s U) > 0 and 02s([Si, S2]) > 0 ? then for some absolute constant C we have: 


II Projff - Projjj || < 


CVk{5 u + 5 s /G2s{[Sl,S2\)) 

a k (Proj s U) 2 a 2s ([Si, S 2 }) 3 ' 


Proof. The proof will proceed in two steps, we first show that if we are given the exact inputs, 


namely 5 S = 5 U = 0 , then the column span of U defined in ( 27 ) is identical to the desired subspace 


U. Then we give a stability result using matrix perturbation bounds. 

1 . Solving the problem using exact inputs. 

Given the exact inputs Si, S2, U\ , U2 , first we show that under the conditions 02s([Si, S2]) > 0 
and <Tfc(Proj5 3 17) > 0, then the column span of the matrix [U2, U\(SjUiy (SjU2)] is indeed 
identical to U = span{V\) = spanfV 2). 


Claim B. 12 . Under the same assumptions of Lemma B.ll, given a matrix V G M fcxfe such that 


V = vjV2, let Proj Uo be the projection to the column span of Uq = [C/2, U\V], then we have 

Projuo = Proju- 

Proof. Given V = 1 //V2, then U\V = Proj^xViF = Proj 5 ±p2. Recall that by definition U2 = 
Pr°j s j_ V2, then the problem is now reduced to the simple problem of merging two projections 
(C/2 = Proj 5 .±V2 and U\V = Proj^xV^) of the same matrix (V^)- Therefore, to show that the 
columns of Uo = [C/2, U\ V] indeed span V2 and thus the desired subspace U, we only need to show 
that [Proj 5 x,Proj s x] has full column span. We show this by bounding the smallest singular value 
of it: 


cr n ([Proj s x, Proj 5 x]) >CT 2 . s ([Proj s x, Proj 5 x 


Si 0 

0 S 2 


) 


=<T2s([ (In - S 2 Sj )Si, (I n ~ SrfJ )S 2 ] ) 

Is -Sjs 2 
—Sj S\ 


=cr 2s ([ Si,S 2 ] 




s 


T 


=02s([ Sl,S 2 ] —Sf [ ]) 

=<T2s([ Si,S 2 ] [ Si,-S 2 ] T [ Si,-S 2 ]) 
=cr 2s ([Si, S 2 ]) 3 
> 0 , 


where the last inequality is by the assumption that 02s([Si, S2]) > 0. 


( 28 ) 

□ 


30 













Next, we show that in the exact case, the matrix V = V^V 2 can be computed by V = 
(SjU\ )HSjU 2 ) . The basic idea is to use the overlapping part of the two projections U\ and 
U 2 to align the two basis p and V 2 . Recall that by its construction, S 3 = (S 1 U = •S'i - H S 2l 

and Projg 3 = Proj 5 x nS x. Then for j = 1 and 2, we have: 

SjUj = SjProi s ±Vj = Sj (Proj s xProj 5 x + Proj 53 Proj s x)p = Sj (0 + Proj 53 )p = SjVj. 

J "J J J 

Moreover, since Uj = Projgxp is an orthonormal matrix, we have that all singular values of 
Vj are equal or greater than 1. Also note that U is an orthonormal matrix, so we have that 
upProj^p) > < 7 fc(Proj 5 3 £/) > 0. In other words, SjVj has full column rank k. Therefore, 

V = (sju 1 )\sju 2 ) 

= (sjv 1 )\sjv 2 ) 

= (p T s 3 sj p)-‘p T S 3 (S 3 T V 2 ) 

= (p T s 3 sj p)- 1 ^ s 3 sj pp f P 2 

= P1 f p 2 


where the third equality is the Moore-Penrose definition, the fourth equality is because Pi and V 2 
are basis of the same subspace, there exists some full rank matrix X £ M fcxfc such that V 2 = Pi A", 
so we have PP/P 2 = pp/pA = P X A = P 2 . 

2. Stability result. 

Given Si, S 2 and U \, U 2 which are close to the exact Si, S 2 , U\ and U 2 , we then need to bound 
the distance ||Proj^ — Proj^y||. This follows the standard perturbation analysis. In order to apply 
we need to bound the distance between \\U — Uq\\f, and lower bound the smallest 


G.5 


Lemma 

singular value of Uq. namely crpt/o)- Recall that we define Uq same as in (27) for the exact case 
with S s = 5 U = 0. 

First, we bound \\U—Uo\\f- Note that we can write Uq as Uq = U 2 B, where B = [I, LpS^C/i^Ss]" 1 " 
Recall that S 3 = (Si U Sa) -1 , apply Lemma 


G.5 


and we have: 


l|S 3 -S 3 || ^ H Pr °j 


S1US2 


- Pro .is, 


us^H - 


<v /g ll[gl,g2]-[5l,S2]||F < 


2 V 2 8 S 


V2s([Sl,S 2 }) 


(T2,([Si,S 2 ]Y 


Next, note that ||S 3 — S 3 || < 1 and \\Ui — U±\\ < 5 U < 1, apply Lemma G .6 

||S 3 t Pi - S 3 t Pi|| < 2(||S 3 - S 3 || + ||Pi - C/i||). 


we have: 


Next, note that crk(SjUi) = opProj^p) > 0 by assumption. Apply Lemma G. 8 , we have: 


||(S 3 T C/i)t-(S 3 T 7/i)t||< 


2y/2\\SjUi — S^ U\ 
cr fc (Proj S 3 Pi ) 2 


Next, apply Lemma |G. 6 | again we can bound the perturbation of matrix product: 


U-Uq || = \\U 2 B-U 2 B\\ 

<2(\\U 2 -U 2 \\ + \\B-B\\) 

= 2(|| U 2 - U 2 \\ + WUi&UtfS, - C/i(S 3 T Pi)tS 3 ||) 

< 2(||^2 - 17 2 || +4(||C/i - C/i|| + USjU.Y - (SjU i)t|| + ||S 3 - S 3 ||)). 
C(5 u + 6 3 /<T28([SuS 2 ])) 
a fc (Proj S 3 P ) 2 
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where C is some absolute constant, and the last inequality summarizes the previous three inequal¬ 
ities, and used the fact that <jfc(Projs 3 Vi) < 1 . Note that \\U — I7o||f < Vk\\U — Uo\\- 

We are left to bound <Jk(Uo). Recall that (Jk(V 2 ) > (C/ 2 ) = 1, and we have shown that in 

the exact case Uo = [Proj^xV^, ProjgxV^]- Then we can bound the smallest singular value of Uq 


following the inequality in (28): 


Cfc(T/o) ><r ri ([Proj S j.,Proj 5 x]) > cr 2 s ([Si, S 2 ]) 3 . 

Finally we can apply Lemma |G.5| to bound the distance between the projections by: 


l|Projp ~ Proj(/ 0 1| < 


V2\\U-U 0 \\ f ^ CVk(5 u + 6 s /a 2 s ([S 1 ,S 2 ])) 


CTk{U 0 ) 


< 


o"fc(Projs 3 Ri) 2 cT2 S ([5i, S 2 }) 3 ' 


□ 


In Step 1 (c), we are given the output U\ and U 2 from Step 1 (b), as well as the output S± 
and S 2 from Step 1 (a). Recall that U = span{ve c(sW) : i € [fc]}, and for j = 1,2, the matrix Uj 
given by Step 1 (b) corresponds to the subspace U projected to the subspace Bj = Sj~ ®kr In- 
Let matrix S 3 = S± n S^~ = (Si U S 2) -1 (obtained by taking the singular vectors of (/„ — AA T ), 
where A corresponds to the first 2k\T~i\ singular vectors of [Si,S 2 ]), and denote B 3 = S 3 <S>kr In- 
Define the matrix Qu to be: 


Qu = [ u 2 , U 1 (B 3 U 1 ^B3U 2 ) 

and similarly define the perturbed version Qu to be: 

Qu = \ U 2 , Ui(B 3 UiyB 3 U 2 ) 


(29) 


Now we want to apply Lemma 


B.ll 


to show that Projg = Proj^ 


and bound the distance 


||Projg^ —Proj^,||. In order to use the lemma, we first use smoothed analysis to show (in Lemma 


B.13 


and Lemma B.14|)that the conditions required by the lemma are all satisfied with high probability 


over the p-perturbation of the covariance matrices, then conclude the robustness of Step 1 (c) in 
Lemma IB. 151 

Lemma B.13. With high probability, for some constant C 

Vk(Projg Y>) > Cepn. 


Proof. This is in fact exactly the same as Claim |H~9 


Given £ = £ + E, by the definition of S3 and B3 we know that B3 only depends on the 
randomness of PjE for i = 1, 2, where 

J = {()i5 J2) : ji £ B\ U B 2 , or j\ e B\ U% 2 }, 

and Pj denotes the mapping that only keeps the coordinates corresponding to the set J . Therefore, 
we have: 

<Rc( p rojg 3 £) > cr fc (Proj (5 T S) xProj^ 3 F;). 

Note that the rank of B 3- is 2 nk\H\) and \ J\ = 2 n\H\, thus n 2 — \ J\ — 2 nk\H\ —k = Ll(n 2 ) > 2 k. So 
we can apply Lemma G.15| to conclude that for some absolute constants C\, C 2 , C3, with probability 
at least 1 — ( Cie) C2n , ak(BjH) > ep'/Cfn 2 . □ 
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Lemma B.14. With high probability, for some constant C, 

^2k\n\([Si,S 2 ]) > Cu o (ep) 2 nr 0 ' 25 . 

Proof. For i = 1,2, recall that S) is the singular vectors of Qs ,, where Qs^ is defined with the set 
PLi as in ( fl2| ). We can write the singular value decomposition of Qs t as Qs, = SjD,y^ for some 
diagonal matrix Di and orthonormal matrix Vi, and 

[Si, S 2 ] = [Qsi,Qs 2 \ 


Vl Df 1 _ 0 
0 V 2 Df x 


Note that we can write [Qsi, Qs 2 ] = [-Psi, Ps 2 ](di&g(Bg^, -Bg 2 )) T , and following almost exactly with 


the proof of Lemma 


B.l 


we can argue that, with probability at least 1 — (Cie) C2T 
° 2 k\H\{[Qs 2 ,Qs 2 ]) > Cui 0 {ep) 2 n. 


Moreover, by the structure of M 4 and the bounds on -< ^ I , we can bound HQsJI < 3^/n(|?f |/3) 3 , 
and thus: 


a k\H\(Vi D i X ) 


1 

-~— > 

&max ( QSi ) 


1 

3y/n{\U\/3)3 


n(n~ 1 - 25 ). 


Therefore, we can conclude that, for some absolute constant C, we have: 


°‘ 2 fc|-H|([-S'i, 5 ' 2 ]) > Cu 0 (ep) 2 n 0 25 . 


□ 


In the next lemma, we apply Lemma B.ll to show that under perturbation, with high probability 


the column span of Projg^ = Projg and this step is robust. 

Lemma B.15. Given the output Si, S 2 and U±, U 2 from Step 1 (a) and (b) based on the empirical 
moments M 4 . Suppose that for i = 1,2, || Si — S)||/7 < S s , || Ui — ?7j||jr < 5 U for 5 S ,5 U < 1. Let 
the columns of U G xk be the k leading singular vectors of Qp defined in (29). Then for some 
absolute constants C, with high probability, 


II Pr 0 Ju 


Projjj || < 


CVk(5 u + 6 s n 0 - 75 /(u> 0 e 2 p 2 )) 
w ^ e 8 ^ 8 ??. 1 - 25 


(30) 


Note that a 2k \ H \ n {{Bi, B 2 \) = cr 2k \ H \ ([Si, S 2 ]), and for i 
5 ) 11/7 < y/n5 s . Therefore, with the above two smoothed ana( 
of <T 2 fc|^|([Si, S 2 ]) and afc(Proj^ 3 (S)), the proof of Lemma 


= 1 , 2 , we have || Bi — Bi\\p < y/n\\Si — 
ysis L emmas showing polynomial bound 
B.15 follows by applying Lemma B.ll 


C Step 2. Unfolding the Moments 

In the second step of the algorithm, we solve two systems of linear equations to recover the unfolded 
moments. 
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Input: 4-th order moments M 4 G M n4 , 6 -th order moments Mq G R ne , the span of 
(vectorized with distinct entries) covariance matrices U G R n2Xfc . 

Output: Unfolded moments in the coordinate system of U: I 4 G Y% G Rgyjf xk 

Let Y 4 be the solution to min y4eK fcxfc || y/SF 4 (UY 4 U T ) — M 4 \\ 2 F . 

Let Yq be the solution to min ygg]R fcxfcxfc ||\/l5^ r 6^6 (^ T 5 U T , U T ) — Mq\\ 2 f . 

Return: Y 4 ,Yq. 

Algorithm 4: EstimateY^Yfj 


C.l Unfolding the 4-th Order Moments 


Recall the hrst system of linear equations 


M 4 = (Y 4 ). 


In the equation, Y 4 G is the unknown variable which can be viewed as a kxk symmetric matrix. 
Given U G R n2Xfc , the column span of E that we learned in Step 1, the first linear transformation 
X 4 is simply X A (Y 4 ) = UY 4 U T . It is supposed to transform Y 4 into the unfolded moments 

xn2 


X 4 G Wf 2 ™ 2 , which is defined to be ]U )=1 vj l vec(Y l -Y)vec(Y^) T . The next transformation yf?>F 4 
maps the unfolded moments X 4 to the folded moments M 4 G R n4 . As we showed in Lemma 3.7 
the mapping P 4 is a projection. 

Since U is the column span matrix of S, there must exist a Y 4 such that X 4 = YiD^Yj T = UY 4 U T 
(recall that D% is the diagonal matrix with entries u>i), so the system must have at least one solution. 

Rewrite the system of linear equations M 4 /y/ 3 = J -40 X A (Y 4 ) in the canonical form: M 4 y/ 3 = 
H 4 vec(Y 4 ) where the variable vec(Y 4 ) G and the coefficient matrix H 4 G M n4Xfc2 is a function 
of U and therefore also a function of the parameter E (recall n 4 = (") and ^2 = ( fc 2 1 )) - s y s t em 
has a unique solution if the smallest singular value of the coefficient matrix H 4 is greater than zero. 

The main theorem of this section shows that with high probability over the p-perturbation the 
system has a unique solution: 

Theorem C.l. With high probability over the p-perturbation ofY,, the smallest singular value of 
the coefficient matrix H 4 is lower bounded by a m i n (H 4 ) > Q(p 2 n/k). As a corollary, the system 
has a unique solution. 


In order to prove this theorem, we first need the following structural lemma: 

Lemma C.2. The coefficient matrix H 4 is equal to A 4 B 4 . The first matrix A 4 G M n4Xfc2 has 
columns indexed by pair {(i, j) : 1 < i < j < k}, and the ( i,j)-th column is equal to CijF 4 {vec{Y^)Q 
uec(E^)). Here Cij = 1 if i = j and C,;j = 2 if i < j. The second matrix B 4 G M fc2Xfc2 transforms 
a k x k symmetric matrices Y 4 into: 

B 4 vec(Y 4 ) = wec((E t t/)Y 4 (S t ?7) T ). 

Next we need to prove the bounds on the smallest singular values for A 4 and B 4 . The first matrix 
A 4 is essentially a projection of the Kronecker product (E <8 >kr El)- In particular, this projection 
satisfy the “symmetric off-diagonal” property defined below: 

Definition C.3 (symmetric off-diagonal). Let the columns of matrix P G M n 2 xrf2 form an (arbi¬ 
trary) basis of the subspace V, and index the rows of P by pair ( i,j) G [ 712 ] x [ 712 ]. The subspace V 
and the matrix P is called symmetric off-diagonal, if (i,i)-th row of P is 0 (“off-diagonal”), and 
the ( i,j)-th row and ( j,i)-th row are identical ('‘symmetric”). 
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Remark C.4. Since symmetric off-diagonal is a property on the structure of rows of the basis 
P. If one basis of the subspace V is symmetric off-diagonal, then any basis is too. Moreover, any 
orthogonal basis of the subspace V will still be symmetric off-diagonal. 

Consider a Kronecker product of the same matrix E G M. n2Xk . The columns of E (g>fc r E are 
indexed by pair (i,j) G [k] x [A;]. Consider applying a symmetric off-diagonal projection P T to the 
Kronecker product. By the property of symmetry the projection will map two columns E[. 
and Er.j] © Ei.a to the same vector. Therefore the projected Kronecker product P T (E ©fc r E) 
will not have full column rank k 2 . However, we will show that the &2 “unique” columns after the 
projection are linearly independent. 

To formalize this, we define the matrix (. E E)uniq £ M n 2 xfcz with the “unique” columns of 

E ®kr E labeled by pairs {(*, j) : 1 < i < j < k}. In particular, 


[{E ©fcr E) un iq\ [ : ,(ij)] — E [-I] © 

In the following main lemma, we show even after projection to any symmetric off-diagonal space 
with sufficiently many dimensions, the “unique” columns of a Kronecker product of random matrices 
still has good condition number. 

Lemma C.5. Let E G M. n2Xk be a Gaussian random matrix (each entry distributed as J\f (0,1)). 
Let P G M n 2 xd2 be a symmetric off-diagonal subspace of dimension ofo = D(n 2 ). Then for any 
constant C > 0, when ri 2 > k 2+c we have with high probability a m i n (P T (E® kr E) uniq ) > H(n 2 ). 


Let us first see how Theorem C.l follows from the two lemmas (Lemma C.2 and Lemma C.5 


Proof, (of Theorem C.l) Using the structural Lemma C.2, we know we only need to bound the 


smallest singular value of A 4 and B 4 separately. The following two claims directly imply the 
theorem. 

Claim C.6. cr m j n (A 4 ) > Ll(p 2 n 2 ). 

Claim C.7. a min (P 4 ) > 1/(4||E|| 2 ) > 1/(4 nk). 

Next we prove the two claims. 


We apply Lemma C.5 to prove Claim 


C .6 


Note that the p-perturbed covariances E is not a 
random Gaussian matrix, yet it is equal to the unperturbed matrix E plus a random Gaussian 
matrix Ey, = pl^\ Since we consider arbitrary E, the columns of E as well as the columns A 4 may 
not be incoherent. 

Instead, we project A 4 to a subspace to strip away the terms involving the original matrix E. 
Let S be the range space corresponding to the projection P 4 . Recall that |Sj = = ^(n 2 ), and 

by the definition of P 4 , S is symmetric off-diagonal. Define the subspace S' = span(S'- L ,E ©fc r 
In 2 >In 2 ®kr £). Let P = (S'') 2 -. By construction |P| > |5| — 2kn^ = fl(? r 2 ). Also, since P = (S') 1 - 
is a subspace of S, it must also be symmetric off-diagonal (see Remark C.4). After projecting A 4 


to P, we know that the (i,j )-th column (1 < i < j < k) of P T A 4 is given by: 


pT iMb,(iJ)] = C M P ' ( S [:,*l © S [:J] + P E \:,i] © S [: j] + pEj.^] © E [:J] + p Z E [:>i] © E [:J] ) 




= C i , j p 2 P T E [ , i] © E[.j], 


'Note that the diagonal entries are then arbitrarily perturbed, but we will project on a symmetric off-diagonal 
subspace so changes on diagonal entries do not change the result. 
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Thus in P T A 4 all the terms involving £ disappears. Therefore 


© &min(P ^4) — < ^min{.P ©fcr E)uniq) — P &min(,P (P ®kr P^uniq) ^ Tl 2) 


jT 


T/ 


where the first inequality is because the smallest singular value cannot become larger after projec¬ 
tion, the first equality is by definition, the second equality is by the property of P, and the final 


step uses Lem ma C.l 
For Claim 


C.7 


Pick any Y 4 E we have 


\\B 4 (Y 4 )\\ = ||vec((£ t t/)y 4 (£ t K) T )|| = \\{tfU)Y 4 {tfU) T \\ F > \\Y 4 \\ F amin(tfU) 2 = ||y 4 || F /||£| 


where the inequality is because ||AB||jr > a m i n (A)\\B\\F if A E M mxn and m > n. Since ||vec(l 4 )|| 
is within a factor of \/2 to ||l 4 ||i^, and by the assumption £ 0 -< we can bound ||£|| < ^l(\fnk), 
we have the desired bound for a m i n (B 4 ). □ 


Structure of the Coefficient Matrix 


In this part we prove the structural Lemma C.2 


First, assume we know the true £ matrix, then in order to get the unfolded 
moments X4, we only need to solve the equation J r 4 (£D 4 £ T ) = M 4 with the k X k symmetric 
variable D4, and the solution should be equal to the diagonal matrix D%- 

However, we only know U which is the column span of £, so we can only use UY 4 U T and 
let L r l 4 L rT = £Z? 4 £ T . Note that there is a one-to-one correspondence between 1 4 and D4. In 
particular we know D4 = (£ff7)l 4 (£tf7) T , this is exactly the second part B4. 

Now the first matrix A 4 should map vec(L> 4 ) to M 4 . By construction, the (z,j)-th column 
(* < j ) of A 4 is equal to ,F 4 (£« © £^ + £^ © £«) = 2J r 4 (£« © £^), since P 4 is symmetric 
off-diagonal we know P 4 (v 4 © V 2 ) = P 4 {v 2 © v\) for any two vectors v\, V 2 ■ For the (z, z)-th column, 
by construction they are equal to y 4 (£W © £®) as we wanted. □ 


Proof, (of Lemma C.2 


Main Lemma on Projection of Kronecker Product In this part we prove Lemma |C. 5 

The singular values of Kronecker Product between two matrices are well-understood: they are 
just the products of the singular values of the two matrices. Therefore, the Kronecker product of 
two rank k matrices will have rank k 2 . However, in our case the problem becomes more complicated 
because we only look at a projection of the resulting matrix. The projected Kronecker product may 
no longer have rank k 2 because of symmetry. Here we are able to show that even with projection 
to a low dimensional space, the rank of the new matrix is still as large as . 

The basic idea of the proof is to consider the inner-products between columns, and show that 
the columns are incoherent even after projection. 


Proof, (of Lemma C.5) Consider the matrix ( E P)uniqPP T (P ©fcr E) 


uniqi 


we shall show the 


matrix is diagonally dominant and hence its smallest singular value must be large. In order to do 
that we need to prove the following two claims: 


Claim C.8. For any i, j < k, i < j. with high probability \\P T ©£[©]) IP > ^(nl). 
Claim C.9. For any i. j < k, i < j, with high probability 


X] I (P T ( E [:,i\ 0 E [:,j])’ P T ( E [:,i'] 0 I - 0 ( n !)- 

l<i'<j'<k,(i,j)^(i' ,j’) 


“Note that although diagonal entries are not perturbed, we also have Pum = 0 so we can still apply the lemma. 


36 


















With this two claims, we can apply Gershgorin’s Disk Theorem G.9| to conclude that cr m i n ((E<S>kr 
E )Zniq PpT i E ®kr E )uniq) > D(n|). There fore < 7 m m(P T (E ©fc r E )uniq) > D(n 2 ). 

Now we prove the two claims. For Claim C.8 it essentially says the projection of a random vector 
to a fixed subspace should have large norm. If the vector has independent entries, this is first shown 
in Tao and Vu (2006). Recently Vu and Wang (2013) generalized the result to K-concentrated 

\E\. 


vectors, see Lemma 


G.18 


By Lemma 


G.19 


we know conditioned on ||P[.^]||, \\E[. j}\\ < 2^/fH, 


(E[ : i j 0 E[.j]) Pt q(p / q) is 0(-^/n2)-concentrated. By assumption P ignores all the (E^^ © P[ : j]) P jP 
entries. Therefore Pr[|||P T (£[. ,] 0 P).^) || 2 — d 2 | > 2ty/d^ + t 2 ] < Ce _Q 0 2 / n2 ) + e ~tt(n 2 ) yy e then 
pick t = y/du/S > D(n 2 ), which implies Pr[||P(£ , [. j] ©P[.j])|| 2 < d 2 /2] < Ce~^ n2 \ This is what 
we need for Claim [C~8l 

For Claim C.9i we need to bound terms of the form (P T (E\.^ P T {E © E [:,j’]))- 


These are degree-4 Gaussian chaoses and are well-studied in Latala et al. ([2006). 

We break the terms according to how many of i',j' appears in i,j. 

Case 1: i',j' 0 {z, j}. In this case we first randomly pick E [ : j], and condition on the high 
probability event that ||P[ :i j]||, ||P[ :J -]|| < 2y / n 2 . In this case the inner-product can be rewritten as 
{PP T (E[ :! i] © E [.,j])i ( E [: t i'} © E '[:,/]))) and we know ||PP T (P[.,j] © P[ : j])|| < 4n 2 . Also, since P is 
symmetric off-diagonal we know in this degree-2 Gaussian chaos (only E^.y\ a nd E[ : j >] are random 
now) there are no “diagonal” terms. Therefore the Deco upling Theorem |G. 23 shows without loss of 
generality we can assume i! ^ j'. Apply Theorem G.21 we know this term is bounded by 0(n\ +e ) 
with high probability for any e > 0. 

Case 2: One of i', j' is in Without loss of generality assume i' E {i,j} (the other case is sym¬ 

metric). Again we first randomly pick P[ ;j j],P[.j] and condition on the high probability event that 
||P[ ;) j]||, ||P[.j]|| < 2 v /n 2 (but this will also determine P^j/j). After the conditioning, only E[ : ji] is 
still random, and the inner-product can be rewritten as ^mat(PP T (P[.j] © E^^)E^.^, P[ : j/]) where 

the fixed vector mat(PP T (P[ : i ] © E[ : j] ))P[ : y] has norm bounded by ||PP T (P[. ,i]® E [:,j])\\\\ E [:,i']\\ ^ 

3/2 3/2_i_ e 

8 n 2 ■ By property of Gaussian with high probability the inner-product is bounded by 0(n 2 ) 

for any e > 0. 

Case 3: i', j' E {i, j}. Since i',j' cannot be equal to ?, j, there is only one possibility: i',j' are both 
equal to one of i,j and i / j. Without loss of generality assume i! = j’ = i / j. We can swap i,j 
with i',j’ and this actually becomes Case 2. By the same argument we know this term is bounded 
by 0 (ruj 2+e ) for any e > 0. 

There are 0(k 2 ) terms in Case 1, 0(k ) terms in Case 2 and 0(1) terms in Case 3. Therefore by 
union bound we know the sum is bounded by 0(kruj 2+c + k 2 n^ +e ) with high probability. Recall we 
are assuming n 2 > k 2+c (which only requires n > k 1+c//2 ). Choose e to be a small enough constant 
depending on C gives the result. □ 


C.2 Unfolding 6-th Order Moments 

Recall the second system of linear equations is 

Mg/V15 = P 6 o X%(Yq). 

In the equation, Yq E is the unknown variable which can be viewed as a k X k X k 

symmetric tensro. The first linear transformation X§ transforms Yq into the unfolded moments 
Xq E M.™ 2 £ l n2Xn2 , which is supposed to be equal to Y^l=i frjvec(SW)© 3 . The transformation is 
simply Xq = Xq (Yq) = Y/i(U T , U T , U T ) where U E W l2Xk is the column span of S that we learned 
in the previous section. 
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The next transformation J-q maps the unfolded moments Xq to the folded moments Mg G 

is a projection. Recall that n§ = (g) 


which as we showed in Lemma 


3.7 


Rewrite the system of linear equations Mg/V15 = J-qoXq (Y§) in the canonical form: Mg/\/15 = 
HQvec(Yia) where the coefficient matrix Hq G R ” 6 x j s a function of U and therefore is a function 
of S (recall k 3 = ( fc + 2 )). 

The second system of linear equations tries to unfold the 6 -th order moment Mq to get Yq. 


Similar to Theorem C.l the following theorem guarantees that with high probability over the 
perturbation the system has a unique solution. 

Theorem C.10. With high probability over the perturbation, the coefficient matrix Hq has smallest 
singular value a m in(He) > n(p 3 (n/fc) 1 ' 5 ). As a corollary, the system has a unique solution. 


The proof of this theorem is very similar to the proof of Theorem |C.l| Here we list the important 
steps and highlight the differences. 


As before the theorem relies on a structural lemma (Lemma C.ll), and a main lemma about the 


symmetric off-diagonal projection of a Kronecker product of three identical matrices (Lemma C.13). 


Lemma C.ll. The coefficient matrix Hq is equal to AqBq. The first matrix Aq G R n6Xfc3 has 
columns indexed by triples (21,22,23) for 1 < ii < *2 < *3 < k, and are given by: 


[MunM,i 3 )) = C ilii2ii3 X 6 (vec(Y^) 0 vec(Y <*>) © vec(Y^)), 


where Ci lt i 2 ,i 3 is a constant depending only on multiplicity of the indices ( 21 , 22 , 23 ). The second 
matrix B$ G M fc3Xfcs transforms a k x k x k symmetric tensor Yq into: 

b 6 (y 6 ) = y 6 ((s t c/) T , (st u) T , (st u) T ). 


Before stating the main lemma, we update the definition of symmetric off-diagonal subspace. 

Definition C.12. Let the columns of matrix P G M n 2 xc k form a basis of a subspace V. Index 
the rows of P by triples (21,22,23) £ [222] x [77.2] x [712]. The matrix P and the subspace V are 
called symmetric off-diagonal if: whenever 21 , 22,23 are not distinct the corresponding row is 0 
(“off-diagonal”); and for any permutation it over { 1,2,3}, the rows corresponding to (ii,i 2 >* 3 ) and 
(LiR): hr( 2 )>*tt( 3 )) are identical (“symmetric”). 


It is easy to verify that since the moments in Mq all have indices corresponding to distinct 
variables, the projection J-§ is indeed symmetric off-diagonal. The constraints in this definition is 
closely related to the decoupling Theorem G.23 of Gaussian chaoses. 

Similarly, we define the “unique” columns in the 3-way Kronecker product to be the matrix 
(_£/ 0 ^ 7 . E E^j un iq 
and (E E L l )unig)[:,(j ll j 2 ,j 3 )] = 


l n 2 xk 3 w hose columns are labeled by triples ( 21 , 22 , 23 ) : 1 < T < 22 < 23 < k, 


Lemma C.13. Let E G M n2Xfc be a Gaussian random matrix. Let P G M n 2 xc b be a symmetric 
off-diagonal subspace of dimension d 3 > Ilfri'f). For any constant C > 0, if > k 2+c , with high 
probability a min (P T (E ® kr E ® kr E) uniq ) > Vt(ru/ 2 ). 


The proofs of Theorem C.10|are based on the above two lemmas. The proof of Lemma C.ll 


is 


essentially the same as Lemma C.2 The proof of Lemma C.13 is very similar to that of Lemma C.5 
and we highlight the only different case below: 
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Proof, (of Lemma C. 13 ) 


As before we try to prove that the columns of P T (E E ®kr E) un i q are incoherent. Recall 
we needed the following two claims: 

Claim C.14. For any 1 < i\ < *2 < 23 < k, with high probability || P T qe [:M qe [:M )W 2 > 
tt(n 3 2 ). 

Claim C.15. For any 1 < i\ < *2 < *3 < k, with high probability 


£ 

1^2^ ^^2 —^3 ?(^1 ?^2 5^2 5^3) 


P 


(-£[:,il] © %i 2 ] © ^[:,i 3 ])> P 0 %*!,] ® ^[:,^])/ ^ °( n 2) 


The first claim can still be proved by the projection Lemma G. 18 , except the vector Eu^ j 0 


© 1 ?[. j 3 ] is now 0(n2)-concentrated (the proof is an immediate generalization of Lemma G. 19 ). 

The second claim can be proved using similar ideas, however there is one new case. We again 
separate the terms according to the number of i\. i ’ 2 , *3 that do not appear in {*1,22,23}- 
Case 1: At least one of i\ , i 2 ■ *3 does not appear in {*i, * 2 , *3}. Suppose there are t of i\ , i' 2 , i' :i that 
do not appear in {*i, *2, *3}, similar to before we first sample E ^, Aj 2 , Ei 3 and condition on the event 
that they all have norm at most 2y / r*2. The inner-product then becomes an order t Gaussian chaos 
with Frobenius norm n, ^ 2 . By Theorem G .23 and Theorem G .21 we know with high probability 


all these terms are bounded by n 2 *© +c for any constant e > 0. 

Case 2: All of i \, i' 2 , *3 appear in {*1,22,23}- In the previous proof (of Lemma C. 5 ), there was 


only one possibility and it reduces to Case 1 . However for 6-th moment we have a new case: 
2 = *i = *2 = i'i < *2 = *3 = *3 = j (and the symmetric case i± = i\ = i' 2 < *2 = *3 = *3)- 
For this we will treat T = PP T as a 6-th order tensor with Frobenius norm at most rnj 2 (as a 



Finally we take the sum over all terms and choose e to be small enough (depending on C ), then 
when k 2+c < r*2 the sum is a lower-order term. □ 


C.3 Stability Bounds 

For the two linear equation systems in ([ 7 ]), we can write them in canonical form with coefficient 
matrices and the unknown variable vec(l4), vec(Yg), corresponding to the ^2,^3 distinct 

elements in symmetric I4, Yg, namely: 

Fh^veciYi) = M4/V3, iLgvec(Yg) = Mg/\/X 5 . 

When the empirical moment estimations for are used throughout the algorithm, 

both the coefficient matrices H 4, Hq and the constant terms M 4, Mg are affected by the noise from 

9 The notation might be confusing here: T(Ey.^y, Ey.^y, I, 7, 7) is a 3rd order tensor, and we are applying it to 
The whole expression is equal to T(E\.n, Eun, Er.ji, Tfo;], Ei. ji, Er.ji). 
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empirical estimation. In practice, instead of solving systems of linear equations, we solve the least 
square problem: 


min \\V3F 4 (UY 4 U t ) - M 4 f, min \\Vy>EqYq(U t ,U T ,U T ) - M 6 \\ 2 . (31) 

\r A ^ X /c -yr t-mk X k X k 

^4tM S y m 16^u&sym 

and the solution to the least square problems are given by: vec(> 4 ) = h\m± and vec(le) = H^Mq. 

Lemma C.16. Given the empirical f-th and 6 -th order moments M 4 = M 4 + E 4 , = Mq + Eq, 

and suppose that the absolute value of entries in E .4 and Eq are at most <5i. Let U be the output 
of Step 1 for the span of the covariance matrices, and suppose that \\U — U\\ < 5 2 - Suppose that 
<5i < mm.{\\M 4 \\F/y/nf,\\MQ\\F/, and 82 < min{l, crfc 2 (IL 4 )/2, ak 3 (H/)/2}. Then, conditioned 
on the high probability event that both Ufc 2 (H 4 ), dk 3 (Eq) are bounded below, we have: 

»n-n» f <o(( {l + -^ i )^). 
lu-ye»,<o((j 1 + -|^)vs 8 ). 

Proof. We write the proof for Y 4 , the proof for Yq is exactly the same except changing the subscripts. 

Recall that the coefficient matrix H 4 corresponds to the composition of two linear mappings 
p 4 (UY 4 U T ) on the variable Y 4 . Since we have showed that +4 is a projection determined by the 
Isserlis’ Theorem and independent of the empirical estimation of the moments, we can bound the 
perturbation on the coefficient matrices by: 

\\H 4 -H 4 W < \\UQ 2 -UQ 2 \\<2\\U-U\\\\U\\ + \\U-Ug< 38 2 < \\H 4 \l 


Similarly, we have \\Hq — Hq\\ < \\U © 3 —U © 3 || < 78 2 < ||i?6||- 

Therefore we can analyze the stability of the solution to the least square problems in (31) as 
follows: 


|vec(T 4 ) - vec(y 4 )|| = H 4 M 4 - H\M a 


ft; 


< o(II^III|m 4 - m 4 || +1| h\ - h\\\ ||m 4 ||) 

< o(||m 4 - M 4 || +11 if] - h\ iiv^I) 

< 0(^^1 + 11 ^ 11111 ^ 111 ^)) 

1 


< O 4 / 714 ( 6 1 + 


°k 2 (H< 1) 2 


62) 


where the first inequality is by applying Lemma G.6 and note that ||(M 4 — < <5i y/nf < 


||M 4 ||i7, the second inequality is because ||M 4 ||f < 0 (y/n~ 4 ), the third inequality is by applying 
the perturbation bound of pseudo-inverse in Theorem |G.7[ the fourth inequality is by the assump¬ 
tion that b 2 is sufficiently small compared to the smallest singular value of El 4 thus Ofc 2 (IL 4 ) = 
0(a k 2 (H4))- 

□ 
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Input: the span of covariance matrices U £ M naXfc (vectorized with distinct entries), the 
unfolded 4-th and 6 -th moments Y\ £ M fcxfc and Yq £ 1 ^kxkxk j n ^he coordinate system of U. 
Output: Parameters Q = {(a;*, EW) : i £ [A:]}. 

Compute the SVD of Y 4 : Y4 = \ / 2 A 2 V 2 r . 

Let G = Y 6 (V 2 A' 1/2 , V 2 A ~ 1/2 , V 2 A~ 1/2 ) 

Find the (unique) first k orthogonal eigenvectors Vi and the corresponding eigenvalues A i 
of G , denoted by {(vi, \f) : i £ [A]} 

For all i £ [k], let vec(SW) = A iUV 2 A 2 G Vi, let Ui = (Aj) -2 . 

Return: Q = {(wj,sW) : i £ [k]}. 

Algorithm 5: Tensoi'Decomp 


D Step 3: Tensor Decomposition 


Given the estimations of the unfolded moments Y 4 and Yq from Step 2, and given the span of 
covariance matrices U from Step 1, Step 3 use tensor decomposition to robustly find the parameters 
of the mixture of zero-mean Gaussians. 

Recall that in the coordinate system with basis U, the covariance matrices (vectorized with 
distinct entries) are given by = Ua^ for all i. The unfolded moments in the same coordinate 
system are: 

k k 

Y 4 = ^2 <g> 3 . 

2=1 2=1 


We will apply tensor decomposition algorithm to find the crW’s. We restate the theorem for or¬ 


thogonal symmetric tensor decomposition in Anandkumar et al. Anandkumar et al. (2014) below: 


Consider k orthonormal vector 
k ' '• ^ 3 . Given 


Theorem D.l (Theorem 5.1 in Anandkumar et al. (2014)) 

v\,...Vk £ M n ’s and k positive weights Ai,...A k- Define the tensor T = Y11=i^i v ' 

T = T + E and assume that ||I£|| < C\ minjAjj/A:, then there is an algorithm that finds A i’s and 
Vi’s in polynomial running time with the following guarantee: with probability at least 1 — e~ n , for 
some permutation n over [A] and for all i £ [k], we have: 


\vi ~vi\\< 0(\\E\\/Xi), \Xi - Aj| < 0(||S||). 


In order to reduce our problem to the orthogonal tensor decomposition so that the tensor 
power method (Algorithm 1 , page 21 in Anandkumar et al. (2014)) can be applied, we use the 
same “whitening” 


technique as in Anandkumar et al. (2014) 
unfolded 4-th moments I 4 = V 2 A 2 V 2 


We first compute the SVD of the 


then use the singular vectors to transform the unfolded 6 -th 
moments Yq into an orthogonal symmetric tensor Yq (y 2 A 2 ,V 2 A 2 ,V 2 A 2 ). 

Next we complete the stability analysis for the two-step procedure, i.e. whitening and orthogonal 
tensor decomposition, which was not analyzed in Anandkumar et al. (2014). 


Theorem D.2. Consider k linearly independent vectors ai,...,afc £ W 1 , and k positive weights 
u 1 , • • • ,Wfc. Define G 2 = Yli=l u i a i and G 3 = Yli=\ u i a i ® a i® a i e K S/ X m Xn - Let 
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Train — m m{a mrn {G 2 ), 1 }, 7max — &r, 
that: 


~(G2), and let u Q = min{wj}. Given G2,G 3 and assume 


02 - G 2 \\f <S 2 <o , ||Ga - G 3 \\f <8 3 <o 


7, 


1.5 


k 


There exists an algorithm that finds cii and Qi in polynomial (in variables (n, k. I/cr m i n (G2))) run¬ 
ning time with the following guarantee: with probability at least 1 — e~ n , for some permutation n 
over [ k ] and for all i G [A:] we have: 

I \a n (i) - «7r(i)|| < poly{\\G 3 \\,l/a m in(G 2 ), 1/uj 0 )S 2 +poly(\\G 3 \\,l/a m i n (G 2 ), l/u 0 )S 3 , 

II & - Wj|| < poly(\\G 3 \\,l/a min (G 2 ))S 2 + poly(\\G 3 \\,l/<Jmin(G2))5 3 . 


Proof, (to Theorem D. 2 ) 

1. Algorithm 

We first apply the whitening technique in 


Anandkumar et al. 


— 172 

singular value decomposition of G 2, and note that the matrix V^Ag whitens G2 in the sense that 

- —' 1 /o _1 /o _1 /o 

G ? 2(1 / 2A 2 , V2A2 ' ) = I n . Similarly we can whiten G 3 with the matrix V2A2 and obtain the 

following symmetric 3 -rd order tensor G G 

G = g 3 (d 2 a- 1/2 , v 2 T 2 l/2 , f 2 a 2 “ 1/2 ). 

Note th at in the exact case with G 2 and G 3 , we have that: 


( 2014 ): Let G2 = V2A2V2 be the 


g = \iVi® 3 , 


i= 1 


where A* = w- 1 ^ 2 , and the vectors v t = \ i 1 f2 T A 2 and they are orthonormal. Also note that 

_ -y /2 

A mi n > 1 and A mnt < w n . We can then apply orthogonal tensor decomposition (Algorithm 1 in 


Anandkumar et al. 


( 2014 )) to G to robustly obtain estimations of vfs and Aj’s. After obtaining 
the estimation and Xfs, we can further obtain the estimation of afs and w,’s as: 


_ — — 1/2 _- — ^ 

ai = V 2 A 2 ' ViXi, Qi = (A i) 


-2 


( 32 ) 


2 . Stability analysis 

The estimation of the vectors and weights are given in ( 32 ). In order to bound the distance 


|a* — di|| and ||cD* — Wj||, we show the stability of the estimation V2, A 2 , an d Vj. Xi separately^ 


First, note that by assumption ||G 2 — G 2 ||_f < 82, we can apply Lemma 
bound the singular values and the singular vectors of G 2 by: 


G .2 


and Lemma 


G .3 


to 


||^2 — V2H < V 2 & 2 /imim ||A 2 — A 2 || < 5 2 . 


_/ 2 

Define X = V2A 2 ' and define Ax = X — X. By the assumption that 82 < o^Tmin), we ha ve 


HV2 - V2II < 1 and ||Aj 
bound ||Ax||: 


-1/2 


-A- 1/2 || < ||A 


-1/2, 


-1/2 


IIAx|| < 0(||V2 — V 2 IIHA 2 ' || + ||"^ 2 1|||A. 
<o( ^ 7 - 1/2 + h~ 1/2 ) 2 do 

— I ryl >mm ' V I mm / ^ 

\ ' i 


2 11 < 7 min ■ Therefore we can apply Lemma 

_1 / 2 II _L IITV.II II A " -1 / 2 


G.6 


to 


-A 


—1/2. 


< 0 ( 82 /% 


mm 

1.5 \ 

min' J 
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Moreover, since S 2 < o( 7 min ), we also have ||Ax|| < ||A"|| = 7 ~° n 5 . 

Next, we bound the distance ||G — G||. Recall that G = G 3 (X,X,X). Using the fact that 
tensor is a multi-linear operator, and by the assumption that ||G ' 3 — <S', 3 11 < < 5 , 3 , we have: 


e=\\G-G\\ < \\G 3 (X,X,X)-G 3 (X,X,X)\\ F 

< \\G 3 (X,X,X) - G 3 (X,X,X )II + II G 3 (X,X,X) - G 3 (X,X,X)H 

< 3||G 3 (A x ,A,A)|| +3||G 3 (A x ,A x ,A)|| + ||G 3 (A X , A x , A x )|| + «5 3 ||A|| 3 


<7||G 3 ||||A|| 2 ||A X || + 

/IgsjU 1 


mi + ||A x ||) 3 <5 3 


< o 




^ 2.5 
I min 


S 2 + 


^ 1-5 
I mm 


S 3 


2.5 1.5 

Note that by the assumption 6 2 < o( , ipi”n ), S 3 < o( mi ' 


apply Theorem 


D.2 


L ), we have e < o(^). Therefore we can 
(over the randomness of the 
randomized algorithm itself), the tensor power algorithm runs in time poly(n, k, 1 /X m in) and for 
some permutation n over [A;] it returns: 


to conclude that with probability at least 1 — e~ 


r(») 


- v, 




8e 


At, 


|Aj — Aj| < 5e, Vj €E [A;]. 


Finally, since we also have 5e < 1/2 < A m j n /2 we can bound the estimation error of a* and tDj 


as defined in (32) by: 


|a,r(i) - Oj|| < 3(|| AxIIA r 


+ 


1 8e 
TcUUy 

I min A rnm 


A, 


+ 


1 


% 


0.5 


-5e) 


< poly(||G 3 ||, l/cr m i n (G 2 ), 1 /u 0 )S 2 + poly(||G 3 ||, l/cr min (G 2 ), 1 /u 0 )S 3 , 
I & - Wi|| < poly(||G 3 ||, l/a min (G 2 ))S 2 + poly(||G 3 ||, l/a min (G 2 )) 6 3 . 


□ 


Now we can apply Theorem D.2 to our case. 


Lemma D.3. Given I 4 , Y§, U and suppose that WY 4 — | Y/ — Yg||p as well as \\U — U\\ are 

bounded by some inverse poly(n , k, l/w 0 ,1 /p)S. There exists an algorithm that with high probability, 
returns E^ ’s and obi’s such that for some permutation 7 r over [A:], we have the distance ||EW — £(*) || 
and ||wi — Wi|| are bounded by 5. Moreover, the running time of the algorithm is upperbounded by 
poly(n, k, l/w 0 , 1 /p). 


Proof, (to Lemma D.3 


D.2 


We apply Theorem 
l/ominiYi) are polynomials of the relevant parameters. 


and pick G 2 = I 4 , G 3 = Yq. We only need to verify that | l/jj] and 

This is easy to see, since a min(Y^ ) > 

has 


G.15 


w 0 ( 7 min(A) 2 , and the matrix S is a perturbed rectangular matrix which by Lemma 
o- mi „(E) > with high probability. 

Finally, given cr*), and given the output of Step 2 , i.e. U, with inverse polynomial accuracy, we 
can recover EW = U3® up to accuracy polynomial in the relevant parameters. □ 
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Input: Samples Xi from the mixture of Gaussians , number of components k. 
Output: Set of parameters Q = {(wj, EW) : i € [A:]}. 

Estimate M4, M$ using the samples. 

1 N 1 N 

M 4 = Jr x '* <g)4 ’ M 6 = Jf Xi ■ 

2=1 2—1 

Let s = 9 f" \/n] 

(Step 1 (a) Algorithm [ip 

51 = FindColumnSpan(M4, { 1 ,s}), 

52 = FindColunmSpan(M4, {s + 1 , 2 s}). 

(Step 1 (b) Algorithm [2|) 

£ 7 i = FindProjectedSigmaSpan(M4, { 1 ,..., s}, Si), 
t/2 = FindProjectedSigmaSpan(M4, (s + 1 ,..., 2 s}, S2). 

(Step 1 (c) Algorithm [sj) 

[/ = MergeProjections(Si, U\, S2, t/2). 

(Step 2 Algorithm [1]) 

(I4, y 6 ) = Estiinatel4Lfj(A/4, Mg, C/). 

(Step 3 Algorithm [s]) 

C/ = TensorDecomp(l4, Ye,U) 

Return: Q. 

Algorithm 6: MainAlgorithm (Zero-mean case) 


E 


Proofs of Theorem 


3.5 


The results in all previous sections showed the correctness and robustness of each individual step 
for the algorithm for zero-mean case, In this section, we summarize those results to prove that the 
overall algorithm has polynomial time/sample complexity. 

Lemma E.l (Concentration of empirical moments). Given N samples xi,, xjv drawn i.i.d. from 
the n-dimensional mixture of k Gaussians, if N > n‘/5 2 , then with high probability, we have that 
for all ji,, je <E [n\: 


[M 4 ] 


n J3 ,J3 J4 


- [M 4 ] 




<S, 


[Me] 


Jl,33,33 J5,J6 


-[Me] 


Jl J3 J3 J4 J5 J6 


< 5. 


Proof. Let x denote the random vector of this mixture of Gaussians. We first truncate its tail 
probabilities to make all the entries {\x\j for j E [n]) in the vector x be in the range [— y/n, \/n\. 
Apply union bound, we know that with high probability (at least 1 — 0(e -n )), for all indices 


ji, ■ ■ ■, je e M, we have [x\ h ... [x] j6 
the empirical moments by: 


< ?r 3 . Then we can apply Hoeffding’s inequality to bound 


Pr 


|E[x. 


31 


■ Xi, 


36. 


— E[x 


31 


. X 


J6JI ^ ^ 


2 5 2 N 2 

^ exp( -iw )+0(0 - 0 ( 0 - 


□ 


Proof, (of Theorem 3.5 ) 
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We show that, to achieve e accuracy in the output of Step 3 in the algorithm for the zero- 
mean case, the number of samples we need to estimate the moments M 4 and M§ is bounded by a 
polynomial of relevant parameters, namely poly(n, k, l/u 0 , 1/e, 1/p), and each step of the algorithm 
can be done in polynomial time. 

We backtrack the input-output relations from Step 3 to Step 2 and to Step 1, and we show 
that the estimation error in the empirical moments and the inputs / outputs only polynomially 
propagate throughout the steps. 

Q 

First note that we have shown that every steps fails with negligible probability ( 0(e~ n ) for any 
absolute constant C). Then apply union bound, we have that the entire algorithm works correctly 
with high probability. 


1. By Lemma D.3, in order to achieve e accuracy in the final estimation of the mixing weights 
and the covariance matrices, we need to drive the input accuracy of Step 3 (also the output 
accuracy of Step 2) to be bounded by some inverse polynomial in (?z, 1/e, l/p>, l/cu 0 ), Also 
recall that this step has running time poly(n, k, 1/p, l/u; 0 ). 


2. Theorem 


C.l 


and Theorem 


C.10 


guarantee that with smoothed analysis a m in(H 4 ) and a m in(H%) 
are lower bounded polynomially. Then by Lemma [C.16| in order to have the output accuracy 
of Step 2 be bounded by inverse poly(n, 1/e, 1/p, l/w 0 ), we need to drive the input accuracy 
of Step 2 (U, M 4 ) to be bounded by some other inverse polynomial. Step 2 involves solving 
linear systems of dimension 77 . 4^2 and n^ks, thus it running time is polynomial. 


3. Lemma B.13 and B.14 guaran tees t hat with smoothed analysis <Jk(Qu ) is lower bounded 

in order to have the output accuracy of Step 1 (c) ( JJ ) 


B.15 


polynomially. Then by Lemma 
be bounded by inverse polynomial, we need to drive the input accuracy (output S) of Step 
1 (a) and output U t of Step 1 (b) ) to be bounded by some other inverse polynomial. Step 
1 (c) involves multiplications and factorization of matrices of polynomial size, and thus the 
running time is also polynomial. 


4. Lemma B.7 guarantees that with smoothed analysis <Jk(Qu s ) is lower bounded polynomi- 

in order to have the output accuracy of Step 1 (b) (Us) be 


B.10 


ally. Then by Lemma 

bounded by inverse polynomial, we need to drive the input accuracy (output Si of Step 1 (a) 
) to be bounded by some other inverse polynomial. Step 1 (b) involves multiplications and 
factorization of matrices of polynomial size, and thus the running time is also polynomial. 


5. Lemma 


B.l 


guarantees that with smoothed analysis (Jk(Qs) is lower bounded by inverse 


polynomial. Then by Lemma B.3, in order to have the output accuracy of Step 1 (a) ( S ) be 


bounded by inverse polynomial, we need to drive the input accuracy (the moment estimation 
M 4 ) to be bounded by some other inverse polynomial. Step 1 (a) involves multiplications and 
factorization of matrices of polynomial size, and thus the running time is also polynomial. 


6. Finally, by Lemma 


E.l 


bounded by inverse po 
relevant parameters, including k 


in order to have the accuracy of moment estimation ( M 4 , Mg) be 
ynomial, we need the number of samples N polynomial in all the 


□ 


F General Case 

In this section, we present the algorithm for learning mixture of Gaussians with general means. 
The algorithm generalizes the insights obtained from the algorithm for the zero-mean case. The 
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steps are very similar, and we will highlight the differences. 


Input: Samples {xi G M n : i = 1,..., N} from the mixture of Gaussians, number of 
components k. 

Output: Set of parameters Q = {(uji, p^\ EW) : i G [A:]}. 

Estimate M 3 M 4 , Mq using the samples 

1 N 1 N 1 N 
M 3 = — ^ x*0 3 , M 4 = — ^2 Xj< 8 ) 4 , Me = — ^2 X6 ® 3 

i= 1 i=l i =1 

Step 1 (a). (This can be accomplished similar to Algorithm^ FindColumnSpan) 

Let Hi = {1,..., 12 y/n}, find Si = span{/fW, E^ : i G [fc], j G LG}. 

Let LA 2 = {12i/n + 1,..., 24-^/n}, find £2 = span {p^\ s|*\ : i G [fe], j G L^}- 

Step 1 (b) (This can be accomplished similar to Algorithm^ FindProjectedSigmaSpan) 
Find Li = span{Proj s xEW : i g [A:]}. 

Find U 2 = span{ Proj^xE^ : i G [A:]}. 

Step 1 (c) (This can be accomplished similar to Algorithm^ MergeProjections) 

Merge Ui and U 2 to get Z = span{p^ : i G [A;]}, 

U' = span {vec( Pro j z x E W) : i G [A:]}, and U Q = span{Proj z xEWProj z x : i G [*]}■ 

Step 2 

Project the samples to the subspace Z^\ Pro] z ±x = {Proj^xXi,..., Proj^xXAr}. 

Apply the algorithm for zero mean case to the projected samples, 

let G 0 = {(wi,Proj z x£WProj z x) : % g [A:]} = MainAlgorithm (Zero-mean case)(Proj^xa;). 
Step 3 

Let T = [vec(Proj z xE«Proj z x) : i G [A;]] fT G R n2xk , 
and let T® for i G [A;] denote the columns of T. 

Let M 3 ( 4 ) G M nxn be the matricization of M 3 along the first dimension. 

Let /zW = M s ^T^/ui for i G [A:] and let p = [p^ : i G [A;]]. 

Step 4 

Let M 4 = M 4 + 2 E - =1 w*M (i) ® 4 - 

Find the span 5 = span{vec(EW) + p^ 0 fog [A;]}. 

(This can be achieved by treating M\ as the 4-th moments of a mixture of zero-mean 
Gaussians, and apply Step 1 in the algorithm for zero-mean case to find the span of the 
covariance matrices, and let S denote the result.) 

Let E = [vec(EW) : i G [A;]] = (ProjsL 7 — pQ p). 

Return: Q = {(caj, //W, eM) : i G [A:]}. 

Algorithm 7: MainAlgorithm (General Case) 
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Step 1. Span finding In this step, we find the following two subspaces: 

Z = span{Jl W : i £ [A:]}, E 0 = span{Proj^ x EWProjg ± }. 

This is very similar to Step 1 in the algorithm for the zero-mean case, and can be achieved in 
three small steps: 

1. Step 1 (a). For a subset Fi of size 12 y/n, find the span S of the mean vectors and a subset of 
columns of the covariance matrices: 

S = span{pt W ,s|.' ) /] : i £ [k\,j £ Fi}. 

2. Step 1 (b). Find the span of covariance matrices projected to the subspace S 1 -: 

Us = span{Proj 5 xX^ : i G [A]}. 

3. Step 1 (c). Run 1(a) and 1(b) on two disjoint subsets Fix and Fi- 2 - Merge the two spans U\ 
and U2 to get Z and span{ Proj^ x FW : * G [A]}. 

Next, we discuss each small step and compare it with the similar analysis of the algorithm for 
the zero-mean case. 


Step 1 (a). Find the span S of the means and a subset of the columns of the covariance 
matrices Similar to Step 1 (a) for the zero-mean case, in this step we want to find a subspace 
S which contains the span of a subset of columns of S^’s. However, with the mean vector pt^’s 
appearing in the moments, the subspace we find also contains the span of all the mean vectors. In 
particular, for a subset Fi G [n] with \Fi\ = y/n, we aim to find the following subspace: 


S = span{p^\ 


[:J] : * £ [k],3 £ 


(33) 


Similar to Claim 5.1 for the zero-mean case, the key observation for finding the subspace is the 
structure of the one-dimensional slices of the 4-th order moments for the general case: 


Claim F.l. For any indices ji, j 2 , jz £ [n], the one-dimensional slices of M 4 are given by: 


n 

M4( eil ,e i2 ,e j3 ,I) = + 

i= 1 

7rG 


E 

0'ld2j3), j 
U2J3J1), > 

(j3 J1J2) J 


v(d vM + TiWrrWy’M 


_l_ 

1 ^ 7 ri ' 


TTl n-K2 [:,TC 3 \ ' ^70^2^3 


uWnio 

/Vo h 1 


(34) 


Note that if we pick the indices j \, . 72 , j .3 £ 77, all such one-dimensional slice of M 4 lie in the 
subspace S. We again evenly partition the set Fi into three disjoint subset Fi^ and take ji £ Fi^ 
for i = 1 , 2 ,3. Define the matrix Q$ £ as in ( 12 ) whose columns are the one-dimensional 

slices of M/p. 


Qs = 


[[AT 


4 \ e ji i e j 2 ) e 33 i 


I) ■ js £ Fi (3) ] : J2 £ Fi (2) ] : jj £ ? 7 (1) 


f nx(|W|/3) 3 


(35) 


The proof of this step is similar to the Lemmas B.l|(for smoothed analysis) and B.3 (for stability 


analysis). The main difference is that in the matrix B defined in the structural Claim 


B.2 


there is 
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now another block B with k columns that corresponds to the fi® directions, which we can again 
handle with Lemma lG. 121 


Lemma F.2 shows the deterministic conditions for Step 1 (a) to correctly identify the subspace 
S from the columns of Qs, and uses smoothed analysis to show that the conditions hold with high 
probability. 

Lemma F.2 (Correctness). Given M 4 of a general mixture of Gaussians , for any subset Tt G [n] 


and \TL\ = 02 k with the constant C 2 > 9, let Qs be the matrix defined as in (35). The columns 


of Qs give the desired span S defined in (33) if the matrix Qs achieves the maximal column rank 


k + k\TL\. With probability (over the p-perturbation) at least 1 — Ce 05n for some constant C, the 
k( 1 + | Ti\)-th singular value of Qs is bounded below by: 

a k{i+\H\)(Qs) — P e V™- 

We construct a basis Ps G jpx(fc+fc|ft|) for the 


The proof idea is similar to that of Lemma 
subspace S as follows. 


B.l 


Ps = 


[p {i) ■ i € [k]\ , [[Sgj : * G [A:]] : j G ««] :l = L 2, 3 


P, 




(36) 


Note that the dimension of the subspace S is at most k{\TL\ + 1) < nj 3. Then we show by the 
Claim about the moment structure that the matrix Qs can be written as a product of Ps and some 
coefficient matrix Bs■ Then we bound the smallest singular value of the two matrices Ps and Bs 
via smoothed analysis separately. The coefficient matrix Bs is slightly different than that in the 
zero-mean case, but has similar block-diagonal structure properties. 

The detailed proof is provided below. 


Proof, (of Proposition F.2 ) 


Similar to structural property in Claim 
in a product form: 


B.2 


for the zero-mean case, we can write the matrix Qs 


Qs = Ps {Du ®kr I\n\) {Bs) T . 

We will bound the smallest singular value for each of the factor, and apply union bound to conclude 
the lower bound of 0 fc(i+i%|)(Qs)- 

The matrix Ps G R nx ( fe + fe l?d) is defined in (36). Restricting to the rows corresponding to [n]\H, 
we can use Lemma 


G.16 


to argue that <r fc ( 1+ > epy/n with probability at least 1 — (Ct 


\0.25 n 


In order to lower bound cr m i n (Bs), we first analyze the structure of this coefficient matrix. The 
matrix Bs has the following block structure: 


B s = 


r ( 0 ) ,r ( 1 ) ,r ( 2 ) ,r (3) 


The first block B^ G R 6 W|/ 3 ) 3 xfc j s a summation of four matrices b{ (>> for i = 0,1,2,3, where 
Rq 0) = p-u(3) © Rfo 2 ) 0 Py^(i), and r| 0) = E W ( 3 ) ^( 2 ) © With some fixed and known row 

permutation 7 r( 2 ^ and 7 r( 3 ), the other two matrix blocks B^ and B^ are equal to S W ( 3 ) W (i) © Pu ( ' 2 ') 
and S^( 2 ) n(i) 0 ( 1 ^ 0 ), separately. 

The block B^ G M(I w I/ 3 ) 3 xA; I w I / 3 is block diagonal with the identical block S^( 3 ) n ( 2 ) + pt^( 3 ) © 
p-pC 2 )- Similarly, with the row permutation n^ 2 \ the other two matrix blocks B^ 2 \B^ are 
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equal to the block diagonal matrices with the identical block (S^( 3 ) ^<i) + p^z) © /i-^n)) and 
(S h( 2 )^(i) +m H ( 2 ) ©M h c i)) respectively. 

Note that we can write the block as: 


— (/X^( 3) 0 /i^(2) + S W (3)^(2)) © £t^(l) + 1 (/r^(3) © /%(1) + S^(3)^(l)) 0 , 

"I - f 7T ' 1 C^) C*) — 2 Ihn i(X\ C*) U,ni( 0 .\ (*) ILn i (1 . 


where it is easy to see the first summand (jl^z) ©/%( 2 ) + £-^( 3 ) ^( 2 )) ©/%(i) is a linear combination 
of the columns of the block diagonal matrix TL 1 ), and similarly the second and third summands 
are linear combinations of the columns of B and B®, and the last summand is simply —2Bq°\ 
Therefore for some absolute constant C (the smallest singular value corresponding to the linear 
transformation) we have that: 


d r 


i(B s ) > Cd min {\B^\B^\B^\B^]) 


Note that B^ = p^yz) © p^w © p^P) only depends on the randomness over the mean vectors. 
Note that the Khatri-Rao product is a submatrix of the Kronecker product, therefore for tall 
matrices Qi and Q 2l we have that a min (Q 1 © Q 2 ) < <J m in(Qi ®kr Q 2 ) = cr min (Qi)a min (Q 2 ). In 
particular, we can bound the smallest singular value of B^ with high probability (at least 1 — Ce°' 5n ) 
as follows: 

o-fc(^ 0) ) > cr k (p n(3) )a k (p n( 2))a k (p nW ) > (peVnf- 

Then condition on the value of the means, we further exploit the randomness over the covariance 
matrices to lower bound a k ^ ^P roj ~( 0 )x [B^, B^ 2 \ B 1 ^ 1 ]^. It is almost the same as the argument 
of the proof for Proposition |B.1| For example, compared to (18) we have the following inequality 
instead: 


Ok ^ p roj([ 5 ( 0 )j( 2 ) i s( 3 )] { . }xw( 2 )x ^ C 3 ) )rProj (s ^ { 2 )iW( 3 )+?w( 2 ) 0 ^ ( 3 )) -l(S W ( 2 ))W ( 3 ) +P H ( 2 ) Qp n (z))J > epy/fi, 
and note that any block in B^ is independent of the randomness of covariance matrices, and we 


have (|B|/3 ) 2 — k — 2k\'H\/3 > 2 k. Simil ar mo difications apply to the inequalities in (20),(21). 
Finally by the argument of Lemma 


G.12 


we can bound a m i n (Bs) with probability at least 


1 - Ce°- 5n (over the randomness of both the perturbed means and covariance matrices): 

Vmin(Bs) > min{(pey / n) 3 , ep\/n} = ep^/n, 
as we assume p to be small perturbation and pe^/n < 1 . 




Step 1 (b). Find the projected span of covariance matrices Given the subspace S = 
span{p^\ : i 6 [A:]} obtained from Step 1 (a), Step 1(b) finds the span of the covariance 

matrices with the columns projected to S 1 -, namely: 

Us = span{Proj 5 x£^ : i € [fc]}. 

This is in parallel with Step 1 (b) for the zero-mean case, and we rely on the structure of the two- 


for the zero-mean case, the following claim shows how the structure of the two-dimensional slices 
is related to the desired span. 


dimensional slices of M 4 to find the span of the projected covariance matrices. Similar to Claim 


B .6 
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Claim F.3. For a mixture of general Gaussians, the two-dimensional slices of are given by: 
M 4 (e h ,e j2 ,I,I) =E^(( £ 5L +^?(4 ) ) T )( S (i) +M W (M W ) T ) 


i =1 


+ p T + S« 2] (/I«) t ) + ) T + e[5 i] (/Z (0 ) T 


■Jl 

■'(*) 


[: J'a]' 
i(0 \T 


Note that given the set of indices Fi we chose in Step 1 (a) and the subspace S, if we pick the 
indices ji,j 2 £ Fi, project the two-dimensional slice to S^, all the rank one terms in the sum are 
eliminated and the projected slice lies in the desired span Us : 

k 

Proj s ±FU(e jl ,e h ,I,I) = +J i j?(Pj2) T ) Pl0 3s ± ^ t \ Vjuh € Fi. 

i— 1 


(*) \T 


Applying the same argument as in Lemma B.7 for the zero-mean case, we can show that with 
high probability over the perturbation, all the projected slices span the subspace Us- 


Step 1 (c). Merge the two projections of covariance matrices Pick two disjoint index set 
FL\ and FL 2 and repeat the previous two steps 1 (a) and 1 (b), we can obtain the two spans U\ and 
U 2 , corresponding to the subspace of the covariance matrices projected to 5i and S 2 , respectively. 

In this step, we apply similar techniques as in Step 1 (c) for the zero-mean case to merge the 
two spans U\ and I/ 2 : we first use the overlapping part of the two projections Proj 5 ± and Proj 5 x 
to align the basis of U\ and U 2 , then merge the two spans using the same basis. 

Note that for the general case, by definition the span of the mean vectors Z lie in both subspaces 
and S 2 , therefore we have cSj 1 C Z 1 - and C Z 1 -. We can show that cq 1 U 5^ = Z 1 - by lower 
bounding <7 n _fc([Proj 5 ±,Proj 5 ±]) with high probability, similar to that in (28). This gives us the 

span of the mean vectors Z. 

Moreover, in the general case, from merging U\ and U 2 we are only able to find th e spa n of 
covariance matrices projected to the subspace Z^. In particular, we can follow Lemma 


B.ll 


and 


Lemma B.15 in Step 1 (c) for the zero-mean case to show that for the general case, we can merge 
U\ and U 2 to obtain the span span{Projj? x I]W : i £ [&]. By further projecting the span to Z 1 - from 
the right side, we can also obtain = .s/jan{Proj^ ± eT' Proj : i £ [&]}. 


Step 2. Find the covariance matrices in the subspace orthogonal to the means Given 
the subspace Z and E 0 = span{ ProjProj r? ± : i E [k]} obtained from Step 1, Step 2 applies 
the zero-mean case algorithm to find the covariance matrices projected to the subspace , i.e., 
Proj^sWProj^’s, as well as find the mixing weights ufs. 

This follows the same arguments as in Step 2 and Step 3 for the zero mean case. Consider 
projecting all the samples to Z^, the subspace orthogonal to all the means. In this subspace, 
the samples are like from a mixture of zero-mean Gaussians with the projected covariance matri¬ 
ces, and the 4-th and 6-th order moment are given by IW 4 (Proj^ ± , Proj^j_, ProjProj^j_) and 
Mg(Proj^j_, Proj^ x , Proj^ ± , Proj^ x , Proj^, Proj^ x ). Since Z is of dimension k, the dimension of 
the zero-mean Gaussian in the projected space is at least n — k = 0(n). 

Note that the subspace Z only depends on the randomness of the means, and random per¬ 
turbation on the covariance matrices is independent of that of fi. The smoothed analysis for the 
moment unfolding in Step 2 and tensor decomposition in Step 3 for the zero-mean case, which only 
depend on the randomness of the covariance matrices, still go through in the projected space. 
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Step 3. Find the means This step finds the mean vectors based on the outputs of the previous 
steps. The key observation for this step is about the structure of the 3-rd order moments in the 
following claim: 


Claim F.4. Let the matrix M 3 n) E 
The j-th row of M-uy is given by: 


be the matricization of M 3 along the first dimension. 


= \^[ x 3 x h x h\ '■ h e N] : h G N 


Wi (jtfveci E«) + /ZfjuW © + s{5] 0 + £ (i) © sg.j 


2—1 


T 


(37) 


The following lemma shows how to extract the means /x®’s from M 3 ( 3 ) using the information of 
the covariance matrices projected to the subspace orthogonal to the means, i.e. X 0 , and the mixing 
weights cjj’s. 

Lemma F.5. Given the mixing weights u>i's and the projected covariances So ’s, define the matrix 


T G 


prvxfc 


to be the pseudo-inverse ofT, 0 : 


T = 


vec(Y$) : i E [/c] 


ifT 


The mean //W of the i-th component can be obtained by: 


M (,) = lM m Th 

UJi 


This step correctly finds the means if the X 0 is full rank with good condition number, and this holds 
with high probability over the perturbation. 


Proof, (of Lemma F.5 


The basic idea is that since £ 0 lies in the span of P = Proj^x ©*, r Proj^ X) and the last three 
summands in the parenthesis in (37) all lie in span{I n ®kr Projg, Projg ©fc r /„} = span{P - 1 }. 
Therefore hitting the matrix M 3 n) with xj, from the right will eliminate those summands and pull 
out only the mean vectors. 

Recall that the columns of the matrix X Q are vec(Proj^±Sd)Proj^x) = Pvec(S®)’s, and the 
columns of X are vec(Sd))’s. 

Note that T = (PX)t T = PP,^ T , and the columns of T lie in span{P}. Also note that for all 
i,j G [k] the vectors //W © juW, X^ © and /fd) © x|*^ all lie in the subspace span{I n ®kr 
Proj^, Proj©fc r I n } = span{P - 1 }. Therefore these terms will be eliminated if we multiply 
the columns of T to the right of M-yy. For the first term //^vec(xW), since vec(S^)) T Tj. j] = 

(Pvec(S^7)) T lj. = 1 [i=j]- Therefore, we have . 

The smoothed analysis for the correctness of this step is easy. We only need to show that both 
X Q and X robustly have full column rank with high probability over perturbation o f the covariance 


matrices, and thus the pseudo-inverse T is well defined. This follows from Lemma G.15 


Finally, the stability analysis for this step is also straightforward using the perturbation bound 
for pseudo-inverse in Theorem |G.7[ O 
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Step 4. Find the unprojected covariance matrices Note that by definition Z = span{Jl W : 
i G [A:]}, the projected covariance Proj^ x (^ 1 ) we obtained in Step 2 is also equal to Proj^ x (EW + 
//(*)(//(*)) T ). In Step 4 we try to recover the missing part of the covariance matrices in the subspace 
Z. Note that since we have also obtained the means in Step 3, it is equivalent to finding (SW + 
/x( l )(//(*)) T ) for all i. We will show that if we can find the span{(T^ +//W(//W) T ) : i G [fc]}, the 
projected vector Proj^ x (SW -f- jlM (££(*) ) T ) can be used as anchor to pin down the unprojected 
vector. 

They key observation for finding the span of span{(Y >W + : i G [A;]} is to first 

construct a 4-th order tensor M’ A which corresponds to the 4-th moment of a mixture of zero-mean 
Gaussians with covariance matrices (IW and then follow Step 1 in the algorithm for 

zero-mean case to find the span of the covariance matrices for this new mixture of Gaussians. 

The next lemma shows how to construct such 4-th order tensor: 

Lemma F. 6 . Given the 4-th moment IW 4 for a mixture of Gaussians with parameters {£ 4 , EW} ; 
define the f-th order tensor to be: 

k 

M' =M 4 + 2^w^ (0 ® 4 , 

2=1 

then M 4 is equal to the 4-th moment of a mixture Gaussians with parameters {u5j, 0, £W+juW(//W) T }. 

The proof follows directly from Isserlis’ Theorem. Therefore we can repeat Step 1 in the zero- 
mean case here to find the span of the space {vec(Sb-)) 4 - Jj ,W 0 jf l ) ■ j ^ [A:]}. Since we also know 
the projection of E^’s in a large subspace (in the subspace Projsq 0 fc r Proj^ x obtained from Step 
2 ), we can easily recover En^’s: 

Lemma F.7. For any matrix U G M dxfc and any subspace P, given P T U and the span S of columns 
of U, the matrix U can be computed as 

U = S(P T S^(P T U). 

Further, this procedure is stable if crmin(P T S) is lower bounded. 

Proof. This is a special case of the Step 1 (c) where we merge two projections of an unknown 
subspace. 

The span S is equal to UV for some unknown matrix V. We can compute V = (P T U)^P T S , 
and hence U = SIW 1 = S(P T S)^(P T U). The stability analysis is similar (and simpler than) 
Lemma ID. Ill □ 

We will apply this lemma to where the subspace P is Proj^ x < 8 >fc r Proj ^ x . Since the perturbation 
of the means and the covariance matrices are independent, we can lower bound the smallest singular 
value of P T S. 


F.l Proof Sketch of the Main Theorem 13.41 


The proof follows the same strategy as Theorem 3.5 First we apply the union bound to all the 


smoothed analysis lemmas, this will ensure the matrices we are inverting all have good condition 
number, and the whole algorithm is robust to noise. 

Then in order to get the desired accuracy e, we need to guarantee inverse polynomial accuracy in 
different steps (through the stability lemmas). The flow of the algorithm is illustrated in Figure [5} 
In the end all the requirements becomes a inverse polynomial accuracy requirement on IW 4 and Mq, 


which we obtain by Lemma E.l 
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Figure 5: Flow of the algorithm for the general case 


G Matrix Perturbation, Concentration Bounds and Auxiliary Lem¬ 
mas 

In this section we collect known results on matrix perturbation and concentration bounds. In 
general, matrix perturbation bounds are the key for the perturbation lemmas, and concentration 
bounds are crucial for the smoothed analysis lemmas. We also prove some corollaries of known 
results that are very useful in our settings. 


G.l Matrix Perturbation Bounds 


Given a matrix A = A + E where if is a small perturbation, how does the singular values and 
singular vectors of A change? This is a well-studied problem and many results can be found in 
Stewart and Sun Stewart (1977). Here we review some results used in this paper, and prove some 
corollaries. 

Given A = A + E, the perturbation in individual singular values can be bounded by Weyl’s 
theorem: 

Theorem G.l (Weyl’s theorem). Given A = A+E, we know (Jk{A) — \\E || < <Jk{A) < Uk(A) + \\E\\. 

We can also bound the ^ 2 norm change in singular values by Mirsky’s Theorem. 

Lemma G.2 (Mirsky’s theorem). Given matrices A, E £ M mxn with m > n, then 


+ E) - ai(A)Y <\\E\\ f . 

\ i =1 


For singular vectors, the perturbation is bounded by Wedin’s Theorem: 


Lemma G.3 (Wedin’s theorem; Theorem 4.1, p.260 in Stewart and Sun (1990)). Given matrices 
A,Eg W nxn with m>n. Let A have the singular value decomposition 


A = [U l ,U 2 , U-i] 


Si 

0 

0 


0 

S 2 

0 


[Pi,b>r 


Let A = A + E, with analogous singular value decomposition. Let be the matrix of canonical 
angles between the column span of U\ and that of U\, and © be the matrix of canonical angles 
between the column span of V\ and that of V\. Suppose that there exists a 5 such that 

min |[Lab j - [E 2 ],j| > 6 , and min |[Iqb j| > S, 
i,j i,i 
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then 


sin($)|| 2 + || sin(0)|| 2 < 2 


\E\\ 


6 2 


We do not go into the definition of canonical angles here. The only way we will be using this 
lemma is by combining it with the following: 


Lemma G.4 (Theorem 4.5, p.92 in Stewart and Sun (1990)). Let $ be the matrix of canonical 
angles between the column span of U and that of U, then 


II Projfj - Projjj || = || sin$||. 

As a corollary, we have: 

Lemma G.5. Given matrices A,E £ M mxn with m > n. Suppose that the A has rank k and the 
smallest singular value is given by cpA). Let S and S be the subspaces spanned by the first k 
eigenvectors of A and A = A + E, respectively. Then we have: 

115- S\\ < \\Proj § - Proj s \\ = || Proj §± - Proj s ±\\ < ■ 

Moreover, if \\E\\p < ak(A)/\/2 we have ||5 — 5|| < 

Proof. We first prove the first inequality: 

||Proj i? -Proj iS || = ||25(5-5) T + (5-5)(5-5) T ||>2||5||||5-5||-||5-5|| 2 >||5||||5-5|| = ||^-5||. 

The equality is because Projgx = / — Proj 5 so the two differences are the same. The final step 
follows from Wedin’s Theorem and Lemma I G. 41 □ 


Often we need to bound the perturbation of a product of perturbed matrices, where we apply 
the following lemma: 

Lemma G.6. Consider a product of matrices A\ ■ ■ ■ A^, and consider any sub-multiplicative norm 
on matrix || • ||. Given A\,..., Aj~ and assume that \\Ai — Ai\\ < ||Aj||, then we have: 

ll£ ■■■A k -A l ---A k || < 2 fc_1 H ll^ll £ 

i=l i=1 " *H 

The proof of this lemma is straightforward by induction. 


Perturbation bound for pseudo-inverse When we have a lowerbound on cf rmn (A), it is easy 
to get bounds for the perturbation of pseudoinverse. 


Theorem G.7 (Theorem 3.4 in Stewart (1977)). Consider the perturbation of a matrix A G 
B = A + E. Assume that rank(A) = rank(B) = n, then 


||St_ylt||<V2pt||||St|||| E 


As a corollary, we often use: 


Lemma G.8. Consider the perturbation of a matrix A £ M mxn : B = A + E where p|| < 
Pmin{A)/2. Assume that rank(A) = rank(B) = n, then 

pt —A f || <2V2\\E\\/a min (Af. 


Proof. We first apply Theorem |G.7 
P'1'11 = 1/ cr m i n {A). By Weyl’s theorem a, 
Omin (B)- 1 < 2a rnin (A)~ 1 . 


and then bound 

( B ) > a„ 


pt|| and ||P||. 
in(A) - p|| > a 


By definition we know 
mini A) / 2, hence ||P|| = 

□ 
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G.2 Lowerbounding the Smallest Singular Value 

Gershgorin’s Disk Theorem is very useful in bounding the singular values. 

Theorem G.9 (Gershgorin’s theorem). Given a symmetric matrix X G M fcxfc , a lower bound on 
the smallest eigenvalue is given by: 

®min (X) > min { X hl - V X id 

JTT-,- 

Sometimes, it is easier to consider the projection of a matrix. Lowerbounding the smallest 
singular value of a projection will imply the same lowerbound on the original matrix: 

Lemma G.10. Suppose A G R mxn , let P G M mXfi be a subspace, then crk(P T A) < ak(A). 

Proof. Observe that (P T A) T (P T A) = A T (PP T )A A A T A (because P is a subspace). Therefore 
the eigenvalues of (P T A) T (P T A) must be dominated by the eigenvalues of A T A. Then the lemma 
follows from the definition of singular values. □ 

As a corollary we have the following lemma: 

Lemma G.ll. Let A G M mxn anc ^ SU pp 0se that m > n. For any projection Proj s , we have that 
the singular values are non-increasing after the projection: 

ai(Proj s (A)) < <Ti(A), fori = 1,..., n. 

In several places of this work we want to bound the singular value of a matrix, where part of 
the matrix has a block structure. 

Lemma G.12. For given matrices G M mxn and C*W G W nxn ' for i = 1 Suppose 

md > (n + n'd), Define the tall matrix A G ^ mdx ( n + dn ') : 

- s(i) 0 • • • O' 

5(2) 0 C (2 ) ••• 0 r ... i 

A = . . . B,diag{C «) . 

_ B^ 0 0 • • • C (rf) _ 

The smallest singular value is bounded by: 

V(n+d,n')( A ) > min{cr n (B), cr n >(Proj^ BW) xC^) :i = l,...,d}. 

Proof. The idea is to break the matrix into two parts A = Proj s A + Proj s ±A.Since these two 
spaces are orthogonal we know <J( n +dri)( A ) ^ min{iT n (Proj B A), ad n /(Proj B ± A)}. 

For the first part, clearly <r n (Proj B A) > a n (B), as B is a submatrix of Proj^A. 

For the second part, we actually do the projection to a smaller subspace: for each block we 
project to the orthogonal subspace of B^ l \ Under this projection, the block structure is preserved. 
The dn '-th singular value must be at least the minimum of the n '-th singular value of the blocks. 
In summary we have: 

°{n+dn')( A ) >min{<T n (S), o- dn /(Proj jB _Ldiag(C' W ))} 

> min{<7 n (i?), cr dri /(Proj diag((B ( i ) )i) diag(C' ( * ) ))} 

> min{<T n (5), cr (in /(diag(Proj (B ( i ))xC ,(i) ))} 

> min {a n (B), a n /(Proj (B (i))±C W ) :i = l,...,d}. 

□ 
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Smallest singular value of random matrices In our analysis, we often also want to bound 
the smallest singular value of a matrix whose entries are Gaussian random variables. Our analysis 
mostly builds on the following results in random matrix theory. 

For a random rectangular matrix, Rudelson and Vershynin (2009) gives the following nice result: 

Lemma G.13 (Theorem 1.1 in Rudelson and Vershynin (2009)). Let A E M mxn a nd suppose that 
m , > n. Assume that the entries of A are independent standard Gaussian variable, then for every 
e > 0, with probability at least 1 — ( Ce) m ~ n+1 + e~ c ' n , where C, O' are two absolute constants, we 
have: 


(r n (A) > e(y/m — \Jn— 1). 

We will mostly use an immediate corollary of the above lemma with slightly simpler form: 

Corollary G.14. Let A E M mxn and suppose that m > 2 n. Assume that the entries of A are 

independent standard Gaussian variable, then for every e > 0, and for some absolute constant C, 
with probability at least 1 — ( Ce)°' 5m , we have: 

a n (A) > e\fm. 

This lemma can also be applied to a projection of a Gaussian matrix: 

Lemma G.15. Given a Gaussian random matrix E E M mxn , for some set J E [m] define Ej = 

[E[j,:] ■ j £ J] and Ejc = [Ey^ : j E [m\/J}. Define matrix S E M nxr whose columns are 

orthonormal. Suppose that the matrix S is an arbitrary function of Ej and is independent of Ejc. 
Assume that 


m — \ J\ — r > 2n (38) 

Then for any e > 0, we have that with probability at least 1 — (C'e) 0 ' 5 ( m ~l^~ r ) ; for some absolute 
constant C, the smallest singular value of the projected random matrix is bounded by: 

a n (Proj s ±E ) > e^/m- \ J\ - r. (39) 

Proof. For a matrix A E M mxri , define the fixed matrix Pjc E ]R( m- l^l) xm such that: 

[[ P J c ][:,j] ■ J £ J] = °, : 3 € MA 7] = I(m-\J\)x(m-\J\), 

which only keeps the coordinates that correspond to [m\/ J of any vector in M m . Note that 

a n (Proj s ±E) > a n (Pjc(Proj s ±E)) 

> a n (Proi {PjcS) ±PjcProj s ±E) 

= a n (Proi {PjcS) ±PjcE). 

We justify the last equality below. Note that 


Proj s ±E = E - Proj S E, 

and note that the columns of (PjcProjg-E) lie in the column span of PjcS, therefore, 
^oj ( pj cS )±Pj-Proj s ±E = Proj {PjcS) ±PjcE - Proj (Pjcs) ±( Pj<. Proj S E) 

= Pr °j(P JC S)-L PjcE. 

Finally, note that PjcS, with column rank no more than r, is independent of PjcE, which is a 
random Gaussian matrix of size (m— \ ff\)xn, therefore we have that Proj ( Pjr m± Pj< E is equivalent 


to a (m —\Sf\ — r) X n random Gaussian matrix. Since (38) is satisfied, we can apply Lemma G.13 
and conclude (f39| with high probability. □ 
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However, in the smoothed analysis setting, the matrix we are interested in are often not random 
Gaussian matrices. Instead they are fixed matrices perturbed by Gaussian variables. We call these 
“perturbed rectangular matrices”, their singular values can be bounded as follows: 

Lemma G.16 (Perturbed rectangular matrices). Let A E M mxn anc [ suppose that m > 3 n. If all 
the entries of A are independently p-perturbed to yield A, then for any e > 0, with probability at 
least 1 — (Ce)°' 25m , for some absolute constant C, the smallest singular value of A is bounded below 
by: 

<r n (A) > ep\fm. 

Proof. The idea is to use the previous lemma and project to the orthogonal subspace of A. We 
have that A = A + E, where E E M mxn is a random Gaussian matrix. 

<?n(A') > ^(Proj^xH) = a n (Proj A xE). 

Since m — n > 2 n, we can apply Lemma [G.15| to conclude that for any e > 0, 

cr n (Proj j 4 x£') > epy/m, 

with probability at least 1 — < 1 — (Ce)°' 25m . fi 


G.3 Projection of random vectors 


In Step 2, we need to bound the norm of a random vector of the form u 0 v after a projection, 
where u and v are two Gaussian vectors. In order to show this, we apply the result in Vu and 
) which provides a concentration bound of projection of well-behaved (IP-concentrated) 
random vectors. 

First we cite the definition of “IP-concentrated” below: 


Wang (2013 


Definition G.17. A random vector X = (£i, £2, £n) is K-concentrated (where K may depend 

on n) if there are constants C, C > 0 such that for any convex, 1-Lipschitz function f : C n —> M 
and for any t > 0, we have: 

( t 2 

Pr[|F(X) - med{F(X))\ > t] < Cexp ( ~C 
where med(-) denotes the median of a random variable (choose an arbitrary one if there are many). 


Lemma G.18 (Concentration for Random Projections (Lemma 1.2 in Vu and Wang (2013))). Let 
v be a K-concentrated random vector in C n . The entries of v has expected norm 1. Then there 
are constants C, C' > 0 such that the following holds. Let Proj s be a projection to a d-dimensional 
subspace in C n . 


^ v T Proj s v — d > 2 tVd + t 2 ^j < Cexp(-C'j^). 


In order to apply this lemma in our setting, we need to prove the vectors that we are interested 
in is /Gconcentrated: 


Lemma G.19. Conditioned on the high probability event that H-E). j]||, ||E[.j]|| < 2 yjn 2 , the vector 
[[E[ : i ] 0B|.j]]s iS / : s < s 7 ] is 2 y/nj)-concentrated. 
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Proof. For any 1-Lipschitz function F on [[Et.^i 0 : s < s'], we can define a function 

= F([[i?[. j] O E[. j]] S}S r : s < s']) (if i = j then the function G only takes E^.^ as 
the variable). Under the assumption that \\E[. ||, ||.E7[.j]|| < 2y / h2, this new function G is 
Lipschitz. 

Now we extend G to G* when the input ||_E7j- : i ]||, ||£).jj|| > 2y / n2. Define the truncation function 
trunc(n) = v for ||w|| < 2 y/n2, and trunc(u) = 2y / n2n/||n|| for ||u|| > 2^Jnf. Define the extended 
function G*(E[.^, E^.jj) = G(trunc(U[. j]), trunc(£’[. j])), which is still 2 y / n 2 -Lipschitz since the trun¬ 


cation function is 1-Lipschitz. 

Note that for the two Gaussian random vectors Era , E \.u 
concentration bound in Theorem G.20| on G* , which implies 


~ N(0,I), we can apply Gaussian 


p[|g-(e m ,.e m ) 


— med(G*(£’[. j],> t] < Cexp(— C't 2 fdrv}). 


Since the probability of the event ||U[. j]|[, ||F , [ : .j]|| > 2 ^/ft 2 is very small (~ exp(—D(ri 2 ))), we have 
5 = |med(G(Fl[. ) j], Flj.j]))—med(G*(E'[. ) j], Fi[.j]))| in the order of O^^/nf). Therefore, for t ~ 
we have 


n G*(E [lA ,E [:J] ) - med(G(E [:A , E [:J] ))\ > t] < P[| G* (E [:A , E [:J] ) - med(G(E [:ji] , E [:J] ))\ >t-5] 

< Cexp(— C't 2 /dn 2 ). 


Finally, 


G(E[.^, E[.j]) - med(G(Fl[. i j],S[. J -]))| > t ||F’[ :j i]||, ||-E[ : j] || < 2 y/nf 
^ nG*{E VA ,E^) - med(G(E [:ii] ,E bJ] ))\ > t] 

P[||-®[ : ,i]|| > 2 or \\E[. a \\ > 2y/nf\ 

<C exp(— C't 2 /Anz). 


Therefore the random vector [[E).^ 0 : s < s'] is 2 v /h2-concentrated. □ 

Theorem G.20 (Gaussian concentration bound). Let f : M n —> M be a function which is Lipschitz 
with constant 1. Consider a random vector X ~ A/"(0,I n ). For any s > 0 we have 

¥{\f(X)-E[f(X)}\>s)<2e- Cs \ 

for all s > 0 and some absolute constant C > 0. 


G.4 Gaussian Chaoses 

In Step 2, we want to show that the inner product of two random vectors of the form < Proj(u © 
n),Proj(u On) > is small, where u, v! and v. v' are Gaussian vectors. In order to show this, we 
treat the inner product as a (homogeneous) Gaussian chaos, which is defined to be a homogeneous 
polynomial over Gaussian random variables]^] Our analysis builds on the results of many works 
studying the concentration bound of Gaussian chaoses. 

For decoupled Gaussian chaoses, we mostly use the following theorem, which is a simple corollary 
of Lemma IG.22I 

10 In fact, the squared norm of projected random vectors considered previously is a special case of Gaussian chaos, 
and we treat it separately. 
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Theorem G.21. Suppose a = is a d-indexed array, and ||a||^ denotes its 

Frobenius norm. Let (^Cp))i<i< n j=i....,<i be independent copies of X ~ J\f(0,I n ). For any fixed 
e > 0, with probability at least 1 — Cexp (—C'n 2e ^ d ), 


£ ■ 


v(i) v(^) 
”’ A i d 


< ||a||i^n e . 


Lemma G.22 (Gaussian chaoses concentration (Corollary 1 in Latala et al. (2006))). Suppose a = 
( a ii,...,i d )i<i 1 ,...,i d <n is a d-indexed array. Consider a decoupled Gaussian chaos G = id 

where are independent copies of the standard normal random variable for all i E [n], k E [d]. 


(i) 


(- 


nun 


(IGI > t) < C d exp — — min _ 

\ C d i <k<d (i u ...,i k )es(k,d ) 


\h,—,h 


2/fc> 


vO 


where C d E (0, oo) depends only on d, and S(k, d) denotes a set of all partitions of {1,... ,d} into 
k nonempty disjoint sets I\,... ,Ik, and the norm || • ||/ 1: . ,_j k is given by: 


l a llh,-,4 : = SU P { J2 




(i) 

X- ■ ■ ■ X 


X) . STrJ 1 )^ < i 




£<<: 


■ £o 


S k )\2 


Id 




h. 


< 1 


Proof, (of Theorem |G. 21 ) Apply the inequality: 

ll a ll{i},...,{d} < || a IUi ,...,4 < Il a ll[d] = IMI f, V(/i,... ,4) E S(k, d). 


For a fixed order d and for any e > 0, apply Lemma G.22 and set t = n e ||a||ir. We have that 
P (|G| >t)< Cexp (—C'n 2e / rf ), for some constant C, C'. □ 


For coupled Gaussian chaoses, namely when X^’s are identical copies of the same X, we first 
cite the following decoupling theorem in de la Pena and Montgomery-Smith (1995). 


Theorem G.23. (Decoupling) Let (aj 1 ,..., 4 )i<i 1 ,..., 4 <n be a symmetric d-indexed array such that 
= 0 whenever there exists k / l such that ik = i\. Let X\,...,X n be independent random 
variables and (X^)i <i< n for j = l, dots, d, be independent copies of the sequence (Xf)then 
for all t > 0, 



n 





n 



£ <>(, . iXF 

yM 

id 

> L d t 

< Pr 


a ii,...,id-^-h ' ' ' Xi d 

> L d t 







^1 = l 



< L d Pr 



n 



£ <*». 

> L~ d h 





where L d E (0, oo) depends only on d. 

Essentially this theorem shows for a symmetric tensor with no “diagonal” terms, i.e., = 0 

whenever there exists k ^ l such that 4 = i{), there is only a constant factor difference between 
the coupled and decoupled Gaussian chaos distribution. 

In most of our applications, we do have symmetric tensors with no “diagonal” terms. However 
there is one case where we do have diagonal terms, for which we need the following lemma. 
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Lemma G.24. Let (aq^i 3 )i<i 1 ,...,i 3 <n be a symmetric 3- indexed array and let ||a,||^ denote its 
Frobenius norm. Let X ~ AA(0, I n ), then for any e > 0, with probability at least 1 —Cnexp(—C'n 2e ^ 3 ), 


E 


Ll ,22,23 X-i 1 X-i 2 -^23 


* 1 ,* 2,*3 = 1 


< 4||a|| F n°' 5+<E 


Proof. The sum of the “diagonal” terms is equal to 3 JT / ■ a i,i,jXfXj + Yli a i,i,iXf- Since Xj are 
independent standard Gaussian random variables, with probability at least 1 — C*nexp(— C’n 2e ^) 
(union bound), \Xf\ < n e / 3 for all i 6 [n]. Conditioned on this high probability event, the absolute 
value of the sum is bounded by: 


3 E' 


b 2,2,J 


Xf Xj + 


2 ^ 

i 


i ■ ■ ■ Y 6 
H,1,1^1 


ij=l 


E 31| j )l<j j<n || 1?1 
E 3\/n || J )l<i ,j<n || 
E 3||a|| F n a5+ E 


By Theorem 


G.21 


we know that with probability at least 1 
value of the sum of the “non-diagonal” terms is bounded by ||a 
the proof by applying the union bound. 


— Cexp (—C"n 2e / 3 ), the absolute 
| F nE Therefore we can conclude 

□ 
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